February 2022

(adapted from an old lobste.rs comment)

My favorite interview ever was a systems interview that didn’t go as planned. This was for an SRE position, and while I expected the interview to be a distributed systems discussion, the interviewer instead wanted to talk kernel internals.

I was not at all prepared for this, and admitted it up front. The interviewer said something along the lines of, “well, why don’t we see how it goes anyway?”

He then proceeded to teach me a ton about how filesystem drivers work in Linux, in the form of leading me carefully through the interview question he was “asking” me. The interviewer was incredibly encouraging throughout, and we had a good discussion about why certain design decisions worked the way they did.

I ended the interview (a) convinced I had bombed it, but (b) having had an excellent time anyway and having learned a bunch of new things. I later learned the interviewer had recommended to hire me based on how our conversation had gone, though I didn’t end up taking the job for unrelated reasons having to do with relocation.

I’ve given a number of similar interviews since, on system design or general sysadmin skills. I’ve always tried to go into these thinking about both where I could learn, and where I could teach, and how either outcome would give the candidate a chance to shine.

I’ve done some work lately with teams that deliver their products in very different ways, and it has me thinking about how much our “best practices” depend on a product’s delivery and operations model. I’ve had a bunch of conversations about this tension

On the one hand, some of the teams I’ve worked with build software services that are developed and operated by the same team, and where the customers (internal or external) directly make use of the operated service. These teams try to follow what I think of as “conventional” SaaS best practices:

Their development workflow prioritizes iteration speed above all else
They tend to deploy from HEAD, or close to it, in their source repository
- In almost all cases, branches are short-lived for feature development
They’ve built good automated test suites and well-tuned CI/CD pipelines
Releases are very frequent
They make extensive use of observability tooling, often using third-party SaaS for this
Fast roll-back is prioritized over perfect testing ahead of time
While their user documentation is mostly good, their operations documentation tends to be “just good enough” to onboard new team members, and a lot of it lives in Slack

However, we also have plenty of customers who deploy our software to their own systems, whether in the cloud or on-premise. (Some of them don’t even connect to the Internet on a regular basis!) The development workflow for software aimed at these customers looks rather different:

Deploys are managed by the customer, and release cycles are longer
These teams do still have CI/CD and extensive automated tests… but they may also have explicit QA steps before releases
There tend to be lots of longer-lived version branches, and even “LTS” branches with their own roadmaps
Logging is prioritized over observability, because they can’t make assumptions about the customer tooling
They put a lot more effort into operational documentation, because most operators will not also be developers

From a developer perspective, of course, this all feels much more painful! The managed service use case feels much more comfortable to develop for, and most of the community tooling and best practices for web development seems to optimize for that model.

But from a sysadmin perspective, used to mostly operating third-party software, the constraints of self-hosted development are all very familiar. And even managed service teams often rely on third-party software developed using this kind of model, relying on LTS releases of Linux distributions and pinning major versions of dependencies.

The biggest challenge I’ve seen, however, is when a development team tries to target the same software at both use cases. As far as I can tell, it’s very difficult to simultaneously operate a reliable service that is being continuously developed and deployed, and to provide predictable and high-quality releases to self-hosted customers.

So far, I’ve seen this tension resolved in three different ways:

The internal service becomes “just another customer”, operating something close to the latest external release, resulting in a slower release cycle for the internal service
Fast development for the internal service gets prioritized, with external releases becoming less frequent and including bigger and bigger changes
Internal and external diverge completely, with separate development teams taking over (and often a name change for one of them)

I don’t really have a conclusion here, except that I don’t really love any of these results. /sigh

If you’re reading this and have run into similar tensions, how have you seen this resolved? Have you seen any success stories in deploying the same code internally and externally? Or alternatively — any interesting stories of failure to share? 😉 Feel free to send me an email, I’d be interested to hear from you.

A supportive job interview story

Developing managed vs self-hosted software