An unstructured rant on running long-lived software services

– Be kind to your colleagues. Be kind to your users. Be kind to yourself. This is a long haul and you’ll all fuck up.

⁃ The natural environment for your code is production. It will run there longer than it does anywhere else. Design for prod first, and if possible, make your dev environment act like prod.

⁃ Legacy code is the only code worth caring about.

⁃ Users do weird stuff, but they usually have a very good reason, at least in their context. Learn from them.

⁃ It’s 2022, please do structured logging.

⁃ Contexts and tracing make everyone’s lives easier when it comes time to debug. At minimum, include a unique request id with every request and plumb it through the system.

⁃ Do your logging in a separate thread. It sucks to find a daemon blocked and hanging because of a full disk or a down syslog server.

⁃ Don’t page for individual machines going down. Do provide an easy or automated way for bad nodes to get thrown out of the system.

– Be prepared for your automation to be the problem, and include circuit breakers or kill switches to stop it. I’ve seen health checks that started flagging every machine in the fleet as bad, whether it was healthy or not. We didn’t bring down prod because the code assumed if it flagged more than 15% of the fleet as bad, the problem was probably with the test, not the service.

⁃ Make sure you have a way to know who your users are. If you allow anonymous access, you’ll discover in five years that a business-critical team you’ve never heard of is relying on you.

⁃ Make sure you have a way to turn off access for an individual machine, user, etc. If your system does anything more expensive than sending network requests, it will be possible for a single bad client to overwhelm a distributed system with thousands of servers. Turning off their access is easier than begging them to stop.

⁃ If you don’t implement QOS early on, it will be hellish to add it later, and you will certainly need it if your system lasts long enough.

⁃ If you provide a client library, and your system is internal only, have it send logs to the same system as your servers. This will help trace issues back to misbehaving clients so much.

⁃ Track the build time for every deployed server binary and monitor how old they are. If your CI process deploys daily, week-old binaries are a problem. Month-old binaries are a major incident.

⁃ If you can get away with it (internal services): track the age of client library builds and either refuse to support builds older than X, or just cut them off entirely. It sucks to support requests from year-old clients, force them to upgrade!

⁃ Despite all this, you will at some point start getting requests from an ancient software version, or otherwise malformed. Make sure these requests don’t break anything.

⁃ Backups are a pain, and the tooling is often bad, but I swear they will save you one day. Take the time to invest in them.

⁃ Your CI process should exercise your turnup process, your decommission process, and your backups workflow. Life will suck later if you discover one of these is broken.

⁃ Third party services go down. Your service goes down too, but they probably won’t happen at the same time. Be prepared to either operate without them, or mirror them yourself

⁃ Your users will never, ever care if you’re down because of a dependency. Every datacenter owned by AWS could be hit by a meteor at the same time, but your user will only ever ask “why doesn’t my service work?”

⁃ Have good human relationships with your software dependencies. Know the people who develop them, keep in touch with them, make sure you understand each other. This is especially true internally but also important with external deps. In the end, software is made of people.

⁃ If users don’t have personal buy-in to the security policy, they will find ways to work around them and complain about you for making their lives harder. Take the time to educate them, or you’ll be fighting them continuously.