Questions your users will probably ask about the shared cluster

(Not intended to be exhaustive. ?)

On failures:

  • Why did my job fail?
    • Ok, I saw the error code, but what did that actually mean?
    • Can you make changes on the cluster itself so this will succeed next time?
  • What physical node did my job run on?
    • How can I make sure my retries don’t run on that node again?
    • Ok, I understand that the machines are identical, but can I just make sure to never run on that node ever again?
  • What can I do to prevent this failure from happening again?

On priorities or “quality of service” mechanisms:

  • Why is my job taking so long to run? (And/or, why are my requests being throttled?)
  • How can I increase my priority to avoid this throttling again?
  • Why is that other user’s job running with a higher priority of mine?
    • Can you please throttle their jobs which are getting an unfair priority?
  • I have a critical deadline! Can you please override the priority score so my job runs immediately?
  • Can you reserve some set of resources for my dedicated use for some amount of time?
  • Can you provide some guarantee that my jobs will always run within a specified time?
  • Can I get an interactive way to run on the machine? I don’t want to deal with writing a job script.
    • (This may be requested for a single node, or at any scale up to the full cluster!)

On the development environment:

  • Can you please provide package X on the system? (Where X may be something the admin team has never heard of)
  • Why doesn’t the cluster provide the newest version of X? (Or an updated version of some API)
    • When can I get access to the newest version?
  • Why did you upgrade the cluster so quick? Now my workflow is broken.
    • Can you please go back to the version that works for my job?

Now, I’ll grant you that I can get a little snarky sometimes, but all of these questions may have some valid business reason!

Even seemingly obnoxious requests, like the user who shows up on a Friday asking for exclusive access to the full cluster, might actually turn out to be the most important thing to do in that context.

And valid or not — most of these are questions the users of any shared resource will eventually ask! I’ve run across too many cluster stacks that can’t actually inspect their priority system; don’t provide any tracing for failures; or can’t even tell the user which machines they ran on.

One way or another, you should probably be able to answer these kinds of questions, or you’ll have a lot of trouble operating your system over time.