Uncategorized

(Not intended to be exhaustive. ?)

On failures:

Why did my job fail?
- Ok, I saw the error code, but what did that actually mean?
- Can you make changes on the cluster itself so this will succeed next time?
What physical node did my job run on?
- How can I make sure my retries don’t run on that node again?
- Ok, I understand that the machines are identical, but can I just make sure to never run on that node ever again?
What can I do to prevent this failure from happening again?

On priorities or “quality of service” mechanisms:

Why is my job taking so long to run? (And/or, why are my requests being throttled?)
How can I increase my priority to avoid this throttling again?
Why is that other user’s job running with a higher priority of mine?
- Can you please throttle their jobs which are getting an unfair priority?
I have a critical deadline! Can you please override the priority score so my job runs immediately?
Can you reserve some set of resources for my dedicated use for some amount of time?
Can you provide some guarantee that my jobs will always run within a specified time?
Can I get an interactive way to run on the machine? I don’t want to deal with writing a job script.
- (This may be requested for a single node, or at any scale up to the full cluster!)

On the development environment:

Can you please provide package X on the system? (Where X may be something the admin team has never heard of)
Why doesn’t the cluster provide the newest version of X? (Or an updated version of some API)
- When can I get access to the newest version?
Why did you upgrade the cluster so quick? Now my workflow is broken.
- Can you please go back to the version that works for my job?

Now, I’ll grant you that I can get a little snarky sometimes, but all of these questions may have some valid business reason!

Even seemingly obnoxious requests, like the user who shows up on a Friday asking for exclusive access to the full cluster, might actually turn out to be the most important thing to do in that context.

And valid or not — most of these are questions the users of any shared resource will eventually ask! I’ve run across too many cluster stacks that can’t actually inspect their priority system; don’t provide any tracing for failures; or can’t even tell the user which machines they ran on.

One way or another, you should probably be able to answer these kinds of questions, or you’ll have a lot of trouble operating your system over time.

watching the puppy

Robin Sloan: AGI is already here!

the podcasts I’m listening to these days

Questions your users will probably ask about the shared cluster