Why generic software design advice is often useless

In You can’t design software you don’t work on, Sean Goedecke discusses why generic advice on the design of software systems is often unhelpful.

When you’re doing real work, concrete factors dominate generic factors. Having a clear understanding of what the code looks like right now is far, far more important than having a good grasp on general design patterns or principles.

This tracks with my experience not just of software systems, but also systems with a hardware component (eg ML training clusters) or a facility component (eg datacenters). The specifics of your system absolutely dominate any general design guidance.

As the manager of a team that publishes reference architectures, I do think that it’s helpful to clearly understand where your specific design differs from generic advice. If you’re going off the beaten path, you should know you’re doing that! And be able to plan for any additional validation involved in doing that.

But relatedly, this is part of why I think that any generic advice should be based on some actually existing system. If you are telling someone they should follow a given principle, you should be able to point to an implementation that does follow that principle.

Or else you’re just speculating into the void. Which admittedly can be fun but is not nearly as valuable as speaking from experience.

Large software systems

In Nobody understands how large software products work, Sean Goedecke makes a number of good points about how difficult it is to really grasp large software systems.

In particular, some features impact every part of the system in unforeseen ways:

Why are these features complicated? Because they affect every single other feature you build. If you add organizations and policy controls, you must build a policy control for every new feature you add. If you localize your product, you must include translations for every new feature. And so on. Eventually you’re in a position where you’re trying to figure out whether a self-hosted enterprise customer in the EU is entitled to access a particular feature, and nobody knows – you have to go and read through the code or do some experimenting to figure it out.

Sean also points out that eventually the code itself has to be the source of truth, and debugging requires deep investigation of the continually-changing system.

I’ve seen this happen in a bunch of different orgs, and it does seem to be true, especially for products with a large number of collaborating teams. I would add that in addition to the code itself, you often need to have conversations with the relevant teams to discern intent and history. Documentation only goes so far, eventually you need talk to people.

The trap of prioritizing impact

(I wrote this originally as a comment in RLS in response to a staff-level engineer who was frustrated at how little they got to code anymore, and it resonated with enough folks that maybe it’s worth sharing here!)

There’s a trap I’ve seen a lot of staff+ folks fall into where they over-prioritize the idea that they should always be doing “the right, most effective thing for the company”. When I see engineers complain that they don’t get to code enough, I often suspect they’ve fallen prey to this.

I say that’s a trap! because I see people do this at the expense of their own job satisfaction and growth, which is bad for both them and (eventually) for the company which is likely to lose them.

I don’t blame people for falling into this trap, it’s what we’re rewarded for. I’ve fallen into it! I have stopped doing technical work I cared about, prioritized #impact, and fought fires wherever they arose. I have spent all my time mentoring and teaching and none coding. The result was often grateful colleagues, but also burnout and leaving jobs I otherwise liked.

Whereas when I’ve allowed myself to be like 30% selfish — picking some of my work because it was fun and technical, even when doing so was not the “most impactful” thing I could do — I was happier, learned more, and stayed in roles longer.

An example: I worked on a team that was doing capacity planning poorly and was buying too much hardware. (On-prem, physical hardware.) I could have solved the problem with a spreadsheet, but that was boring and made my soul hurt.

What I did instead was dig into how our container scheduling platform worked, and wrote a nifty little CLI tool that would look at the team’s configured workloads and spit out a capacity requirement calculation. It took about three times as long as the spreadsheet would have, but it was fun and accomplished the same goal and gave me some experience in the container platform. And it wasn’t that much of a time sink.

Was that better for the company? No idea. I hope it was — I hear the tool is still maintained and no one has replaced it with a spreadsheet yet! But that’s a happy accident.

Was it better for me? Absolutely! It was a bit selfish, but it made an otherwise tedious task more fun and I learned some useful tricks.

So — if you wish you had more time to code… go code a bit more. Don’t let the idea of being more effective guilt you into giving it up. Your career is your career and you should enjoy it.

The HPC cluster as a reflection of values

Yesterday while I was cooking dinner, I happened to re-watch Bryan Cantrill’s talk on “Platform as a Reflection of Values“. (I watch a lot tech talks while cooking or baking — I often have trouble focusing on a video unless I’m doing something with my hands, but if I know a recipe well I can often make it on autopilot.)

If you haven’t watched this talk before, I encourage checking it out. Cantrill gave it in part to talk about why the node.js community and Joyent didn’t work well together, but I thought he had some good insights into how values get built into a technical artifact itself, as well as how the community around those artifacts will prioritize certain values.

While I was watching the talk (and chopping some vegetables), I started thinking about what values are most important in the “HPC cluster platform”.

Continue reading

Adam’s weekly update, 2022-12-04

What’s new

This week was really intense from a work perspective. Not “bad intense”, but the kind of week where every day was spent with such a level of focus, that at 5 PM or so I found myself staring off into space and forgetting words. I think I got some good things accomplished, but my brain also felt like mush by the time the weekend came.

Continue reading

happy living close (-ish) to the metal

For various reasons, I’ve been doing a little bit of career introspection lately. One of the interesting realizations to come out of this is that, despite in practice doing mostly software work, I’ve been happiest when my work involved a strong awareness of the hardware I was running on.

Continue reading

The web services I self-host

Why self-host anything?

In a lot of ways, self-hosting web services is signing up for extra pain. Most useful web services are available in SaaS format these days, and most people don’t want to be a sysadmin just to use chat, email, or read the news.

In general, I decide to self-host a service if one of two things is true:

Continue reading

Interesting links I clicked this week

I watched several really interesting talks from SRECon22 Americas this week, and in particular I’d like to highlight:

  • Principled Performance Analytics, Narayan Desai and Brent Bryan from Google. Some interesting thoughts on quantitative analysis of live performance data for monitoring and observability purposes, moving past simple percentile analysis.
  • The ‘Success’ in SRE is Silent, Casey Rosenthal from Verica.io. Interesting thoughts here on the visibility of reliability, qualitative analysis of systems, and why regulation and certification might not be the right thing for web systems.
  • Building and Running a Diversity-focused Pre-internship program for SRE, from Andrew Ryan at Facebook Meta. Some good lessons-learned here from an early-career internship-like program, in its first year.
  • Taking the 737 to the Max, Nickolas Means from Sym. A really interesting analysis of the Boeing 737 Max failures from both a technical and cultural perspective, complete with some graph tracing to understand failure modes.

I also ran across some other articles that I’ve been actively recommending and sharing with friends and colleagues, including:

  • Plato’s Dashboards, Fred Hebert at Honeycomb. This article has some great analysis of how easily-measurable metrics are often poor proxies for the information we’re actually interested in, and discussing qualitative research methods as a way to gain more insight.
  • The End of Roe Will Bring About A Sea Change In The Encryption Debate, Rianna Pfefferkorn from the Stanford Internet Observatory. You should absolutely go read this article, but to sum up: Law enforcement in states than ban abortion is now absolutely part of the threat model that encrypted messaging defends against. No one claiming to be a progressive should be arguing in favor of “exceptional access” or other law enforcement access to encryption.