(Not intended to be exhaustive. ?)
- Why did my job fail?
- Ok, I saw the error code, but what did that actually mean?
- Can you make changes on the cluster itself so this will succeed next time?
- What physical node did my job run on?
- How can I make sure my retries don’t run on that node again?
- Ok, I understand that the machines are identical, but can I just make sure to never run on that node ever again?
- What can I do to prevent this failure from happening again?
On priorities or “quality of service” mechanisms:
- Why is my job taking so long to run? (And/or, why are my requests being throttled?)
- How can I increase my priority to avoid this throttling again?
- Why is that other user’s job running with a higher priority of mine?
- Can you please throttle their jobs which are getting an unfair priority?
- I have a critical deadline! Can you please override the priority score so my job runs immediately?
- Can you reserve some set of resources for my dedicated use for some amount of time?
- Can you provide some guarantee that my jobs will always run within a specified time?
- Can I get an interactive way to run on the machine? I don’t want to deal with writing a job script.
- (This may be requested for a single node, or at any scale up to the full cluster!)
On the development environment:
- Can you please provide package X on the system? (Where X may be something the admin team has never heard of)
- Why doesn’t the cluster provide the newest version of X? (Or an updated version of some API)
- When can I get access to the newest version?
- Why did you upgrade the cluster so quick? Now my workflow is broken.
- Can you please go back to the version that works for my job?
Now, I’ll grant you that I can get a little snarky sometimes, but all of these questions may have some valid business reason!
Even seemingly obnoxious requests, like the user who shows up on a Friday asking for exclusive access to the full cluster, might actually turn out to be the most important thing to do in that context.
And valid or not — most of these are questions the users of any shared resource will eventually ask! I’ve run across too many cluster stacks that can’t actually inspect their priority system; don’t provide any tracing for failures; or can’t even tell the user which machines they ran on.
One way or another, you should probably be able to answer these kinds of questions, or you’ll have a lot of trouble operating your system over time.
I have a lot of friends who are struggling with focus at work given our current shelter-in-place conditions. I’m running into a little bit of this myself — I think my productivity is a little bit lower than usual, even though I’m used to working at home. But for the most part I’m doing ok getting work done.
Instead, what I’m failing at is relaxing. I’m finding it increasingly hard to focus, or enjoy myself at all, when I don’t have a clear “to-do” item in front of me.
- I’m having trouble with any kind of fiction reading, which is usually not a problem at all.
- I’ve been taking more walks, but I spend them either thinking about work or worrying about the state of the world in general.
- I can occasionally fool myself into baking, but only if I think of it as “I need this loaf of bread” instead of “it’s fun to bake!”
- While Leigh and I are slowly working our way through Star Trek: Deep Space Nine, I’ve watched it enough times that it’s as much background noise as media for me.
Instead, when I have any kind of downtime I just feel anxious. I stare at Twitter or the news, or just sit there and worry.
Needless to say this is not any good for work-life balance! I’ve been doing okay at not over-working, mostly thanks to Leigh and the cats, but I’m not sure hours of free-floating anxiety is all that much better.
Anyway. Not sure there’s a point to this, but that’s my current quarantine experience. If this sounds familiar, feel free to shoot me an email and happy to chat! (Apparently I could use the distraction…)
When you operate an evolving distributed system in production for a long time, you often accumulate a runbook of weird hacks for responding to rare events.
Three examples at random:
- A service my team was on-call for would occasionally get into a specific weird state, and start intermittently dropping requests. Getting it healthy again was a complex multi-step process. It was also expensive and had its own production impact, so you didn’t want to do it by mistake!
- Setting up new clusters for a different service required building multiple databases with very specific, environment-dependent configurations.
- Another system had very complex internal state, and inspecting that state involved some fairly arcane and expensive SQL queries. We didn’t have to dig into it often, but this was needed for certain debugging and auditing processes.
Given enough years of operation and a complex enough environment, you can accumulate a long list of these kinds of rare procedures.
Fully automating these procedures is often difficult, because they might require some human inputs or judgement. This is especially true when the situation is rare and occurs only in production, so the causes are poorly understood. Faced with these problems, I’ve seen a lot of teams end up with a big pile of wiki pages instead… which are not fun to parse at 3am when prod is broken.
However, I’m a big fan of building partial automation to handle these kinds of procedures. Instead of making someone copy/paste their way through a complex wiki page at 3am, they should have a tool that can guide them through the procedure. This tool can ask for user input in the places it’s needed, and build in guard rails and confirmation prompts when you’re doing something dangerous.
The downside to building this tooling is that you now have a whole new software project to maintain! Because in my experience, you really do have to treat this as a first-class software project in its own right, maintained alongside your production services.
To put it another way, I’m not advocating for a big pile of scripts. (though that’s better than nothing…) I’m saying you should build something like a kubectl or mysqladmin for your own services.
In the long run, though, I find that this investment really pays off. Having good tooling improves the maintainability of your systems and makes the on-call experience easier. It also translates institutional memory into code, which I’ve found makes onboarding easier and gets people more comfortable with dealing with prod.