Developing managed vs self-hosted software

I’ve done some work lately with teams that deliver their products in very different ways, and it has me thinking about how much our “best practices” depend on a product’s delivery and operations model. I’ve had a bunch of conversations about this tension

On the one hand, some of the teams I’ve worked with build software services that are developed and operated by the same team, and where the customers (internal or external) directly make use of the operated service. These teams try to follow what I think of as “conventional” SaaS best practices:

  • Their development workflow prioritizes iteration speed above all else
  • They tend to deploy from HEAD, or close to it, in their source repository
    • In almost all cases, branches are short-lived for feature development
  • They’ve built good automated test suites and well-tuned CI/CD pipelines
  • Releases are very frequent
  • They make extensive use of observability tooling, often using third-party SaaS for this
  • Fast roll-back is prioritized over perfect testing ahead of time
  • While their user documentation is mostly good, their operations documentation tends to be “just good enough” to onboard new team members, and a lot of it lives in Slack

However, we also have plenty of customers who deploy our software to their own systems, whether in the cloud or on-premise. (Some of them don’t even connect to the Internet on a regular basis!) The development workflow for software aimed at these customers looks rather different:

  • Deploys are managed by the customer, and release cycles are longer
  • These teams do still have CI/CD and extensive automated tests… but they may also have explicit QA steps before releases
  • There tend to be lots of longer-lived version branches, and even “LTS” branches with their own roadmaps
  • Logging is prioritized over observability, because they can’t make assumptions about the customer tooling
  • They put a lot more effort into operational documentation, because most operators will not also be developers

From a developer perspective, of course, this all feels much more painful! The managed service use case feels much more comfortable to develop for, and most of the community tooling and best practices for web development seems to optimize for that model.

But from a sysadmin perspective, used to mostly operating third-party software, the constraints of self-hosted development are all very familiar. And even managed service teams often rely on third-party software developed using this kind of model, relying on LTS releases of Linux distributions and pinning major versions of dependencies.

The biggest challenge I’ve seen, however, is when a development team tries to target the same software at both use cases. As far as I can tell, it’s very difficult to simultaneously operate a reliable service that is being continuously developed and deployed, and to provide predictable and high-quality releases to self-hosted customers.

So far, I’ve seen this tension resolved in three different ways:

  • The internal service becomes “just another customer”, operating something close to the latest external release, resulting in a slower release cycle for the internal service
  • Fast development for the internal service gets prioritized, with external releases becoming less frequent and including bigger and bigger changes
  • Internal and external diverge completely, with separate development teams taking over (and often a name change for one of them)

I don’t really have a conclusion here, except that I don’t really love any of these results. /sigh

If you’re reading this and have run into similar tensions, how have you seen this resolved? Have you seen any success stories in deploying the same code internally and externally? Or alternatively — any interesting stories of failure to share? šŸ˜‰ Feel free to send me an email, I’d be interested to hear from you.

Toy programs for learning a new language

It used to be that I’d get interested in a new programming language, but I wouldn’t have a new project to use it for and had trouble knowing how to start. I have trouble really grasping a new language without building something in it, and “X by example” or working through a book don’t really do the job.

What’s helped me lately is to build an array of “standard” toy programs that I understand reasonably well, and that I can use to explore the new language and figure out how to do something real in it.

Right now, my toy program collection consists of:

  • A link shortening service, like bit.ly or tinyurl, along with a HTTP API for adding and removing links
  • A 2D diffusion simulation
  • A “system package inventory” program, that builds a list of all the RPMs/DEBs installed on a Linux machine and shoves them into a SQLite database

This is almost never what I’d call production-quality code. For example, when I’m writing these toy programs, I rarely write unit tests (until I start exploring the test libraries for the language!). But they’re still very valuable learning tools, and give me space to explore some very different use-cases.

I almost always write all three in a given language, but the order depends a lot on what I think the new language will be good for. For example, I’ll write the “system package inventory” program first if I think the new language might be handy for system administration tools. It’s a great way to see how well the language plays with a common Linux environment, how painful it is to use SQLite, and to get practice writing CLI tools in it. I’ll often augment the basic “scan and store” functionality with a CLI to do frequent queries, like “on what date was this package last upgraded”.

On the other hand, if I think I’m going to use the new language for a bunch of numerical work, I’ll start with the diffusion simulation. When I write that, I often start with a naive implementation and then start playing with profilers and other performance tools, or try to parallelize the simulation. This is also a great excuse to dig into any plotting tools commonly used with the language.

These toy programs are also handy if I want to explore new ways to integrate a service into a larger production environment. For example, I might start with the link shortening service, deploying the service itself statelessly and persisting the list of links into a PostgreSQL DB. Then I start complicating things…

  • Let’s add logging!
  • And tracing!
  • It’s always a good idea to expose Prometheus metrics
  • And wouldn’t it be handy to support multiple database backends?
  • Now wrap it all in a Helm chart for handy deployment

I imagine I’m not the only person to have a standard collection of learning projects for new languages. If you do this too, what does your project list look like?

SRE to Solutions Architect

It’s been about two years since I joined NVIDIA as a Solutions Architect, which was a pretty big job change for me! Most of my previous work was in jobs that could fall under the heading of ā€œsite reliability engineeringā€, where I was actively responsible for the operations of computing systems, but my new job mostly has me helping customers design and build their own systems.

I’m finally starting to feel like I know what I’m doing at least 25% of the time ? so I thought this would be a good time to reflect on the differences between these roles and what my past experience brings to the table for my (sort of) new job.

Continue reading

Sketching out HPC clusters at different scales

High-performance computing (HPC) clusters come in a variety of shapes and sizes, depending on the scale of the problems you’re working on, the number of different people using the cluster, and what kinds of resources they need to use.

However, it’s often not clear what kinds of differences separate the kind of cluster you might build for your small research team:

Note: do not use in production

From the kind of cluster that might serve a large laboratory with many different researchers:

The Trinity supercomputer at Los Alamos National Lab, also known as ā€œthat goddamn machineā€ when I used to get paged at 3am

There are lots of differences between a supercomputer and my toy Raspberry Pi cluster, but also a lot in common. From a management perspective, a big part of the difference is how many different specialized node types you might find in the larger system.

Continue reading

handy utilities for every hpc cluster

I’ve built a lot of HPC clusters, and they’ve often looked very different from each other depending on the particular hardware and target applications. But I almost always find myself installing a few common tools on them, to make their management easier, so I thought I’d share the list.

Continue reading

my default technology choices

I’ve written several partial versions of this post in various emails and Slack posts, and finally decided I should just put it on the blog.

The tech landscape is complex and picking the right tool is hard, but the vast majority of problems can be solved in a ā€œgood enoughā€ way using a wide variety of tools. The best choice is usually the one you know well already. So I tend to think most developers should have a ā€œdefault tech stackā€ that they use for most things, only switching when the problem constraints or early experience dictate otherwise.

And here’s mine! This is the list of tools I usually start with, and use most frequently in production. I will frequently adjust some part of this list for any given project, but I find these are usually useful choices. I don’t expect any of these to be very surprising, but I think there’s some value in writing them down.

Continue reading

Some thoughts after reading Vincenti’s “What Engineers Know and How They Know It”

A few weeks ago I watched Hillel Wayne’s recent talk “Are we really engineers?”, where he looked at the idea of whether software engineers get to call themselves “engineers” or not. (Spoiler: the answer is yes!)

During the Q&A, Wayne mentioned that while he had seen a lot of “philosophy of science”, there didn’t seem to be much “philosophy of engineering” out there. I remembered noticing the same thing, and on Twitter I asked for book recommendations on the topic. The always-reliable Lorin Hochstein obliged, and a week later I had some reading to do!

Just as a disclaimer: this post is very much in theme of “thinking out loud”, and got a little long. šŸ™‚ This is mostly me discussing my experience of reading the book and some thoughts on software engineering I had after reading it. Very likely nothing here is at all original, and I am not an expert, but I wanted to get my ideas down in text after finishing the read. And having done so, I thought it might be worthwhile to share.

Ok, let’s dive in.

Continue reading

Invest in operational tooling

When you operate an evolving distributed system in production for a long time, you often accumulate a runbook of weird hacks for responding to rare events.

Three examples at random:

  • A service my team was on-call for would occasionally get into a specific weird state, and start intermittently dropping requests. Getting it healthy again was a complex multi-step process. It was also expensive and had its own production impact, so you didn’t want to do it by mistake!
  • Setting up new clusters for a different service required building multiple databases with very specific, environment-dependent configurations.
  • Another system had very complex internal state, and inspecting that state involved some fairly arcane and expensive SQL queries. We didn’t have to dig into it often, but this was needed for certain debugging and auditing processes.

Given enough years of operation and a complex enough environment, you can accumulate a long list of these kinds of rare procedures.

Fully automating these procedures is often difficult, because they might require some human inputs or judgement. This is especially true when the situation is rare and occurs only in production, so the causes are poorly understood. Faced with these problems, I’ve seen a lot of teams end up with a big pile of wiki pages instead… which are not fun to parse at 3am when prod is broken.

However, I’m a big fan of building partial automation to handle these kinds of procedures. Instead of making someone copy/paste their way through a complex wiki page at 3am, they should have a tool that can guide them through the procedure. This tool can ask for user input in the places it’s needed, and build in guard rails and confirmation prompts when you’re doing something dangerous.

The downside to building this tooling is that you now have a whole new software project to maintain! Because in my experience, you really do have to treat this as a first-class software project in its own right, maintained alongside your production services.

To put it another way, I’m not advocating for a big pile of scripts. (though that’s better than nothing…) I’m saying you should build something like a kubectl or mysqladmin for your own services.

In the long run, though, I find that this investment really pays off. Having good tooling improves the maintainability of your systems and makes the on-call experience easier. It also translates institutional memory into code, which I’ve found makes onboarding easier and gets people more comfortable with dealing with prod.

Practices of an intermittent developer

Hillel Wayne published a post yesterday on “The Hard Part of Learning a Language”, about all the little “getting started” challenges of learning a new programming language. It resonated with me so much, because I find myself going through this process pretty frequently.

I sometimes describe myself as an “intermittent” software developer, though really I’ve never worked as a developer: I’ve spent most of my career as either a scientist or in operational and support roles. (SRE, sysadmin, pick your job title…) While I’ve written code nearly every day for over a decade, I’ve rarely spent more than a few weeks at a time working on any given piece of software.

Instead, I’ve mostly worked on operational tooling, low-maintenance microservices, or wrote “one-off” code to support an analysis or duplicate an issue. I also spend a lot of time working on other people’s code, but mostly in the context of “fix the damn thing!” The result of this pattern is that I:

  • Frequently switch languages
  • Spend a lot more time reading and analyzing software than writing it
  • Often have weeks or months go by since the last time I touched a language or service
  • Rarely get to become deeply immersed in a given language’s idioms or practices

Because of this, I keep finding that the languages I like best are those that are relatively easy to put down for a while, and pick up again without a ton of friction. This isn’t exactly the same as having an easy learning curve, but more that they don’t require reloading a lot of mental context which is unique to them. The languages I like tend to have:

  • Large standard libraries
  • Minimal need for IDE support or editor plugins
  • Consistent community coding styles, and/or widely-used auto-formatting tools
  • Strong backward compatibility
  • Good documentation
  • Decent integration with Linux distro package managers
  • Communities that converge on “one way to do it” solutions, and make it obvious what they are!

So, for example, I’m a pretty big fan of Go. It’s not very interesting, and I find writing it a bit repetitive (if err != nil ...). But I can go six months without writing any Go, sit down to fix a bug in a project I’ve never worked on before, and generally expect to get my bearings fast. I also tend to like Python a lot, despite some messy spots, because I can almost always work within a pretty stable core consisting of the standard library and a few large, stable packages.

The biggest downside, though, is that I frequently bounce off of languages that I think are exciting but feel like they’d require too much consistent attention to keep up with. For example, I think Rust is one of the most interesting languages out there today… but I’ve been challenged by the combination of a small standard library and relatively fast pace of change (in the ecosystem, not the language!). That combination makes me skeptical that I could follow any kind of “intermittent” pattern with Rust; I feel like I would keep getting lost every time I came back!

To be clear, I don’t think this means the languages I have trouble with should change! They’re clearly really successful, and many are doing really interesting things.

But I do think there’s a lot of value in building tools that are “low-maintenance”, and that language stability has a lot going for it. Without doing a real analysis, I suspect that communities with a lot of part-time developers will often gravitate to languages that change slowly. Certainly scientific computing seems to write a lot of Python, C++, and Fortran — and older versions of those languages at that! And the SRE community definitely publishes a lot of Go.

Then again, maybe I’m wrong! Are there any fast-changing languages popular with part-time developers? Feel free to shoot me an email and let me know. šŸ™‚