(Not intended to be exhaustive. 😉)
- Why did my job fail?
- Ok, I saw the error code, but what did that actually mean?
- Can you make changes on the cluster itself so this will succeed next time?
- What physical node did my job run on?
- How can I make sure my retries don’t run on that node again?
- Ok, I understand that the machines are identical, but can I just make sure to never run on that node ever again?
- What can I do to prevent this failure from happening again?
On priorities or “quality of service” mechanisms:
- Why is my job taking so long to run? (And/or, why are my requests being throttled?)
- How can I increase my priority to avoid this throttling again?
- Why is that other user’s job running with a higher priority of mine?
- Can you please throttle their jobs which are getting an unfair priority?
- I have a critical deadline! Can you please override the priority score so my job runs immediately?
- Can you reserve some set of resources for my dedicated use for some amount of time?
- Can you provide some guarantee that my jobs will always run within a specified time?
- Can I get an interactive way to run on the machine? I don’t want to deal with writing a job script.
- (This may be requested for a single node, or at any scale up to the full cluster!)
On the development environment:
- Can you please provide package X on the system? (Where X may be something the admin team has never heard of)
- Why doesn’t the cluster provide the newest version of X? (Or an updated version of some API)
- When can I get access to the newest version?
- Why did you upgrade the cluster so quick? Now my workflow is broken.
- Can you please go back to the version that works for my job?
Now, I’ll grant you that I can get a little snarky sometimes, but all of these questions may have some valid business reason!
Even seemingly obnoxious requests, like the user who shows up on a Friday asking for exclusive access to the full cluster, might actually turn out to be the most important thing to do in that context.
And valid or not — most of these are questions the users of any shared resource will eventually ask! I’ve run across too many cluster stacks that can’t actually inspect their priority system; don’t provide any tracing for failures; or can’t even tell the user which machines they ran on.
One way or another, you should probably be able to answer these kinds of questions, or you’ll have a lot of trouble operating your system over time.