Invest in operational tooling

When you operate an evolving distributed system in production for a long time, you often accumulate a runbook of weird hacks for responding to rare events.

Three examples at random:

  • A service my team was on-call for would occasionally get into a specific weird state, and start intermittently dropping requests. Getting it healthy again was a complex multi-step process. It was also expensive and had its own production impact, so you didn’t want to do it by mistake!
  • Setting up new clusters for a different service required building multiple databases with very specific, environment-dependent configurations.
  • Another system had very complex internal state, and inspecting that state involved some fairly arcane and expensive SQL queries. We didn’t have to dig into it often, but this was needed for certain debugging and auditing processes.

Given enough years of operation and a complex enough environment, you can accumulate a long list of these kinds of rare procedures.

Fully automating these procedures is often difficult, because they might require some human inputs or judgement. This is especially true when the situation is rare and occurs only in production, so the causes are poorly understood. Faced with these problems, I’ve seen a lot of teams end up with a big pile of wiki pages instead… which are not fun to parse at 3am when prod is broken.

However, I’m a big fan of building partial automation to handle these kinds of procedures. Instead of making someone copy/paste their way through a complex wiki page at 3am, they should have a tool that can guide them through the procedure. This tool can ask for user input in the places it’s needed, and build in guard rails and confirmation prompts when you’re doing something dangerous.

The downside to building this tooling is that you now have a whole new software project to maintain! Because in my experience, you really do have to treat this as a first-class software project in its own right, maintained alongside your production services.

To put it another way, I’m not advocating for a big pile of scripts. (though that’s better than nothing…) I’m saying you should build something like a kubectl or mysqladmin for your own services.

In the long run, though, I find that this investment really pays off. Having good tooling improves the maintainability of your systems and makes the on-call experience easier. It also translates institutional memory into code, which I’ve found makes onboarding easier and gets people more comfortable with dealing with prod.

Practices of an intermittent developer

Hillel Wayne published a post yesterday on “The Hard Part of Learning a Language”, about all the little “getting started” challenges of learning a new programming language. It resonated with me so much, because I find myself going through this process pretty frequently.

I sometimes describe myself as an “intermittent” software developer, though really I’ve never worked as a developer: I’ve spent most of my career as either a scientist or in operational and support roles. (SRE, sysadmin, pick your job title…) While I’ve written code nearly every day for over a decade, I’ve rarely spent more than a few weeks at a time working on any given piece of software.

Instead, I’ve mostly worked on operational tooling, low-maintenance microservices, or wrote “one-off” code to support an analysis or duplicate an issue. I also spend a lot of time working on other people’s code, but mostly in the context of “fix the damn thing!” The result of this pattern is that I:

  • Frequently switch languages
  • Spend a lot more time reading and analyzing software than writing it
  • Often have weeks or months go by since the last time I touched a language or service
  • Rarely get to become deeply immersed in a given language’s idioms or practices

Because of this, I keep finding that the languages I like best are those that are relatively easy to put down for a while, and pick up again without a ton of friction. This isn’t exactly the same as having an easy learning curve, but more that they don’t require reloading a lot of mental context which is unique to them. The languages I like tend to have:

  • Large standard libraries
  • Minimal need for IDE support or editor plugins
  • Consistent community coding styles, and/or widely-used auto-formatting tools
  • Strong backward compatibility
  • Good documentation
  • Decent integration with Linux distro package managers
  • Communities that converge on “one way to do it” solutions, and make it obvious what they are!

So, for example, I’m a pretty big fan of Go. It’s not very interesting, and I find writing it a bit repetitive (if err != nil ...). But I can go six months without writing any Go, sit down to fix a bug in a project I’ve never worked on before, and generally expect to get my bearings fast. I also tend to like Python a lot, despite some messy spots, because I can almost always work within a pretty stable core consisting of the standard library and a few large, stable packages.

The biggest downside, though, is that I frequently bounce off of languages that I think are exciting but feel like they’d require too much consistent attention to keep up with. For example, I think Rust is one of the most interesting languages out there today… but I’ve been challenged by the combination of a small standard library and relatively fast pace of change (in the ecosystem, not the language!). That combination makes me skeptical that I could follow any kind of “intermittent” pattern with Rust; I feel like I would keep getting lost every time I came back!

To be clear, I don’t think this means the languages I have trouble with should change! They’re clearly really successful, and many are doing really interesting things.

But I do think there’s a lot of value in building tools that are “low-maintenance”, and that language stability has a lot going for it. Without doing a real analysis, I suspect that communities with a lot of part-time developers will often gravitate to languages that change slowly. Certainly scientific computing seems to write a lot of Python, C++, and Fortran — and older versions of those languages at that! And the SRE community definitely publishes a lot of Go.

Then again, maybe I’m wrong! Are there any fast-changing languages popular with part-time developers? Feel free to shoot me an email and let me know. 🙂