Quick review: “Built”, by Roma Agrawal

Built: The Hidden Stories Behind Our Structures was a pleasure to read. Authored by Roma Agrawal, a structural engineer who worked on the the “Shard” skyscraper in London, it’s truly a love letter to her profession, and to the structures it involves.

Throughout the book, Agrawal describes a wide variety of engineered structures around the world. From skyscrapers to bridges; from clay bricks to steel; Agrawal’s writing goes into detail on the materials and techniques used to build the constructed world we all walk through.

In each case, Agrawal finds fascinating stories to tell about the cultures and individuals who pioneered various techniques. And throughout the book, it’s clear how much enthusiasm and love she brings to her work.

As someone trained in physics and materials science, I was especially happy to find such an engaging read that talks about such properties as ductility, elasticity, toughness, compression and tension. It’s a rare thing to find someone who can describe how fun these topics can be.

While Agrawal occasionally touches on her difficulties working in structural engineering, as someone who isn’t “traditionally” white and male, she doesn’t dwell on these concerns. To the extent I have a criticism of this work, it’s that she doesn’t spend more time on how she experiences being a professional engineer — including both the positive and negative aspects of the profession.

However, that’s a mild criticism, as that’s clearly not what the book is about. As I said above — this isn’t a memoir, it’s a love letter to her work. And to the extent it is that, it succeeds very well.

An unstructured rant on running long-lived software services

– Be kind to your colleagues. Be kind to your users. Be kind to yourself. This is a long haul and you’ll all fuck up.

⁃ The natural environment for your code is production. It will run there longer than it does anywhere else. Design for prod first, and if possible, make your dev environment act like prod.

⁃ Legacy code is the only code worth caring about.

⁃ Users do weird stuff, but they usually have a very good reason, at least in their context. Learn from them.

⁃ It’s 2022, please do structured logging.

⁃ Contexts and tracing make everyone’s lives easier when it comes time to debug. At minimum, include a unique request id with every request and plumb it through the system.

⁃ Do your logging in a separate thread. It sucks to find a daemon blocked and hanging because of a full disk or a down syslog server.

⁃ Don’t page for individual machines going down. Do provide an easy or automated way for bad nodes to get thrown out of the system.

– Be prepared for your automation to be the problem, and include circuit breakers or kill switches to stop it. I’ve seen health checks that started flagging every machine in the fleet as bad, whether it was healthy or not. We didn’t bring down prod because the code assumed if it flagged more than 15% of the fleet as bad, the problem was probably with the test, not the service.

⁃ Make sure you have a way to know who your users are. If you allow anonymous access, you’ll discover in five years that a business-critical team you’ve never heard of is relying on you.

⁃ Make sure you have a way to turn off access for an individual machine, user, etc. If your system does anything more expensive than sending network requests, it will be possible for a single bad client to overwhelm a distributed system with thousands of servers. Turning off their access is easier than begging them to stop.

⁃ If you don’t implement QOS early on, it will be hellish to add it later, and you will certainly need it if your system lasts long enough.

⁃ If you provide a client library, and your system is internal only, have it send logs to the same system as your servers. This will help trace issues back to misbehaving clients so much.

⁃ Track the build time for every deployed server binary and monitor how old they are. If your CI process deploys daily, week-old binaries are a problem. Month-old binaries are a major incident.

⁃ If you can get away with it (internal services): track the age of client library builds and either refuse to support builds older than X, or just cut them off entirely. It sucks to support requests from year-old clients, force them to upgrade!

⁃ Despite all this, you will at some point start getting requests from an ancient software version, or otherwise malformed. Make sure these requests don’t break anything.

⁃ Backups are a pain, and the tooling is often bad, but I swear they will save you one day. Take the time to invest in them.

⁃ Your CI process should exercise your turnup process, your decommission process, and your backups workflow. Life will suck later if you discover one of these is broken.

⁃ Third party services go down. Your service goes down too, but they probably won’t happen at the same time. Be prepared to either operate without them, or mirror them yourself

⁃ Your users will never, ever care if you’re down because of a dependency. Every datacenter owned by AWS could be hit by a meteor at the same time, but your user will only ever ask “why doesn’t my service work?”

⁃ Have good human relationships with your software dependencies. Know the people who develop them, keep in touch with them, make sure you understand each other. This is especially true internally but also important with external deps. In the end, software is made of people.

⁃ If users don’t have personal buy-in to the security policy, they will find ways to work around them and complain about you for making their lives harder. Take the time to educate them, or you’ll be fighting them continuously.

A supportive job interview story

(adapted from an old lobste.rs comment)

My favorite interview ever was a systems interview that didn’t go as planned. This was for an SRE position, and while I expected the interview to be a distributed systems discussion, the interviewer instead wanted to talk kernel internals.

I was not at all prepared for this, and admitted it up front. The interviewer said something along the lines of, “well, why don’t we see how it goes anyway?”

He then proceeded to teach me a ton about how filesystem drivers work in Linux, in the form of leading me carefully through the interview question he was “asking” me. The interviewer was incredibly encouraging throughout, and we had a good discussion about why certain design decisions worked the way they did.

I ended the interview (a) convinced I had bombed it, but (b) having had an excellent time anyway and having learned a bunch of new things. I later learned the interviewer had recommended to hire me based on how our conversation had gone, though I didn’t end up taking the job for unrelated reasons having to do with relocation.

I’ve given a number of similar interviews since, on system design or general sysadmin skills. I’ve always tried to go into these thinking about both where I could learn, and where I could teach, and how either outcome would give the candidate a chance to shine.

Developing managed vs self-hosted software

I’ve done some work lately with teams that deliver their products in very different ways, and it has me thinking about how much our “best practices” depend on a product’s delivery and operations model. I’ve had a bunch of conversations about this tension

On the one hand, some of the teams I’ve worked with build software services that are developed and operated by the same team, and where the customers (internal or external) directly make use of the operated service. These teams try to follow what I think of as “conventional” SaaS best practices:

  • Their development workflow prioritizes iteration speed above all else
  • They tend to deploy from HEAD, or close to it, in their source repository
    • In almost all cases, branches are short-lived for feature development
  • They’ve built good automated test suites and well-tuned CI/CD pipelines
  • Releases are very frequent
  • They make extensive use of observability tooling, often using third-party SaaS for this
  • Fast roll-back is prioritized over perfect testing ahead of time
  • While their user documentation is mostly good, their operations documentation tends to be “just good enough” to onboard new team members, and a lot of it lives in Slack

However, we also have plenty of customers who deploy our software to their own systems, whether in the cloud or on-premise. (Some of them don’t even connect to the Internet on a regular basis!) The development workflow for software aimed at these customers looks rather different:

  • Deploys are managed by the customer, and release cycles are longer
  • These teams do still have CI/CD and extensive automated tests… but they may also have explicit QA steps before releases
  • There tend to be lots of longer-lived version branches, and even “LTS” branches with their own roadmaps
  • Logging is prioritized over observability, because they can’t make assumptions about the customer tooling
  • They put a lot more effort into operational documentation, because most operators will not also be developers

From a developer perspective, of course, this all feels much more painful! The managed service use case feels much more comfortable to develop for, and most of the community tooling and best practices for web development seems to optimize for that model.

But from a sysadmin perspective, used to mostly operating third-party software, the constraints of self-hosted development are all very familiar. And even managed service teams often rely on third-party software developed using this kind of model, relying on LTS releases of Linux distributions and pinning major versions of dependencies.

The biggest challenge I’ve seen, however, is when a development team tries to target the same software at both use cases. As far as I can tell, it’s very difficult to simultaneously operate a reliable service that is being continuously developed and deployed, and to provide predictable and high-quality releases to self-hosted customers.

So far, I’ve seen this tension resolved in three different ways:

  • The internal service becomes “just another customer”, operating something close to the latest external release, resulting in a slower release cycle for the internal service
  • Fast development for the internal service gets prioritized, with external releases becoming less frequent and including bigger and bigger changes
  • Internal and external diverge completely, with separate development teams taking over (and often a name change for one of them)

I don’t really have a conclusion here, except that I don’t really love any of these results. /sigh

If you’re reading this and have run into similar tensions, how have you seen this resolved? Have you seen any success stories in deploying the same code internally and externally? Or alternatively — any interesting stories of failure to share? 😉 Feel free to send me an email, I’d be interested to hear from you.

A goal for the new year

Last year was, in many ways, a rather difficult year for me.

There were certainly a lot of good things — we moved into our house, we adopted a puppy, and I switched to an interesting new team at work. But there was also a lot of unpleasantness and stress, and I frequently felt like I couldn’t manage to take a breath for fear of letting some urgent thing go undone.

Some of that is unavoidable, of course. We’re entering the third year of a global pandemic, our political situation in the United States is infuriating and dangerous, and family and work will often pick the worst possible time to spring crises on us.

But in my case at least, a lot of the stress I felt came from how I reacted to things. I felt like I couldn’t slow down, couldn’t sit and think, even though doing so was often exactly what I needed to do. The urgency was often something I imposed on myself, not something that came from the situation.

So, to the extent I have a new year’s resolution, it’s to slow down. To spend more time thinking, and planning, and reading, and less time reacting to whatever I find most stressful in the moment. Overall, to reduce urgency.

We’ll see how it goes, of course. Some things are difficult to control.

Toy programs for learning a new language

It used to be that I’d get interested in a new programming language, but I wouldn’t have a new project to use it for and had trouble knowing how to start. I have trouble really grasping a new language without building something in it, and “X by example” or working through a book don’t really do the job.

What’s helped me lately is to build an array of “standard” toy programs that I understand reasonably well, and that I can use to explore the new language and figure out how to do something real in it.

Right now, my toy program collection consists of:

  • A link shortening service, like bit.ly or tinyurl, along with a HTTP API for adding and removing links
  • A 2D diffusion simulation
  • A “system package inventory” program, that builds a list of all the RPMs/DEBs installed on a Linux machine and shoves them into a SQLite database

This is almost never what I’d call production-quality code. For example, when I’m writing these toy programs, I rarely write unit tests (until I start exploring the test libraries for the language!). But they’re still very valuable learning tools, and give me space to explore some very different use-cases.

I almost always write all three in a given language, but the order depends a lot on what I think the new language will be good for. For example, I’ll write the “system package inventory” program first if I think the new language might be handy for system administration tools. It’s a great way to see how well the language plays with a common Linux environment, how painful it is to use SQLite, and to get practice writing CLI tools in it. I’ll often augment the basic “scan and store” functionality with a CLI to do frequent queries, like “on what date was this package last upgraded”.

On the other hand, if I think I’m going to use the new language for a bunch of numerical work, I’ll start with the diffusion simulation. When I write that, I often start with a naive implementation and then start playing with profilers and other performance tools, or try to parallelize the simulation. This is also a great excuse to dig into any plotting tools commonly used with the language.

These toy programs are also handy if I want to explore new ways to integrate a service into a larger production environment. For example, I might start with the link shortening service, deploying the service itself statelessly and persisting the list of links into a PostgreSQL DB. Then I start complicating things…

  • Let’s add logging!
  • And tracing!
  • It’s always a good idea to expose Prometheus metrics
  • And wouldn’t it be handy to support multiple database backends?
  • Now wrap it all in a Helm chart for handy deployment

I imagine I’m not the only person to have a standard collection of learning projects for new languages. If you do this too, what does your project list look like?

Trying to write more, with less pressure

I’ve been pretty bad at blogging for the past *mumble mumble years*, but it’s not for lack of writing.

The thing is, I like writing. I have a rather large drafts folder filled with work-in-progress posts, not to mention all the various brainstorming docs I have for work, D&D, and other writing tasks. Those WIPs are frequently five or ten pages long, with lots of little notes on extra bits I should add to avoid missing things.

Like plenty of other folks, my problem isn’t writing, it’s finishing things.

However, this blog is titled “thinking out loud”! I don’t need to write the definitive article on a given topic, or at least I don’t need to do it here. Instead, I want this to be a place where I can get thoughts out in front of people (and myself!) so I can make them better.

To that end, I’m setting a goal for 2022 to write here:

  • At least once every two weeks
  • With only light editing
  • At most two pages of text in a post
  • And being willing to delete anything I decide I hate! 😅

To make my life easier, I’m cheating a little: I’ve written four short posts in the past week, and set them to auto-publish every two weeks!

That should get me through February. In the mean time, I’ll keep writing — with any new posts either being added to the queue, or potentially posted live if I have a particularly hot take. With luck, this process will help me stick to my goal despite any temporary crises or fits of ennui, and keep my momentum up.

Happy New Year!

SRE to Solutions Architect

It’s been about two years since I joined NVIDIA as a Solutions Architect, which was a pretty big job change for me! Most of my previous work was in jobs that could fall under the heading of “site reliability engineering”, where I was actively responsible for the operations of computing systems, but my new job mostly has me helping customers design and build their own systems.

I’m finally starting to feel like I know what I’m doing at least 25% of the time ? so I thought this would be a good time to reflect on the differences between these roles and what my past experience brings to the table for my (sort of) new job.

Continue reading

Sketching out HPC clusters at different scales

High-performance computing (HPC) clusters come in a variety of shapes and sizes, depending on the scale of the problems you’re working on, the number of different people using the cluster, and what kinds of resources they need to use.

However, it’s often not clear what kinds of differences separate the kind of cluster you might build for your small research team:

Note: do not use in production

From the kind of cluster that might serve a large laboratory with many different researchers:

The Trinity supercomputer at Los Alamos National Lab, also known as “that goddamn machine” when I used to get paged at 3am

There are lots of differences between a supercomputer and my toy Raspberry Pi cluster, but also a lot in common. From a management perspective, a big part of the difference is how many different specialized node types you might find in the larger system.

Continue reading

handy utilities for every hpc cluster

I’ve built a lot of HPC clusters, and they’ve often looked very different from each other depending on the particular hardware and target applications. But I almost always find myself installing a few common tools on them, to make their management easier, so I thought I’d share the list.

Continue reading