Quoting Charity Majors

Charity’s latest post, Bring back ops pride, is an excellent discussion (rant?) on the importance of operations for software systems and why it’s a bad idea to try and pretend it isn’t a real concern, or make conventional application teams do the work in addition to their regular job.

“Operations” is not a dirty word, a synonym for toil, or a title for people who can’t write code. May those who shit on ops get the operational outcomes they deserve.

You should absolutely go read the full piece, as well as Charity’s earlier post on the Honeycomb blog: You had one job: Why twenty years of DevOps has failed to do it.

Below find several pull quotes from the post itself, because there were just too many to choose from.

Continue reading

Quoting Nicholas Carlini

Because when the people training these models justify why they’re worth it, they appeal to pretty extreme outcomes. When Dario Amodei wrote his essay Machines of Loving Grace, he wrote that he sees the benefits as being extraordinary: “Reliable prevention and treatment of nearly all natural infectious disease … Elimination of most cancer … Prevention of Alzheimer’s … Improved treatment of most other ailments … Doubling of the human lifespan.” These are the benefits that the CEO of Anthropic uses to justify his belief that LLMs are worth it. If you think that these risks sound fanciful, then I might encourage you to consider what benefits you see LLMs as bringing, and then consider if you think the risks are worth it.

From Carlini’s recent talk/article on Are large language models worth it?

The entire article is well worth reading, but I was struck by this bit near the end. LLM researchers often dismiss (some of) the risks of these models as fanciful. But many of the benefits touted by the labs sound just as fanciful!

When we’re evaluating the worth of this research, it’s a good idea to be consistent about how realistic — or how “galaxy brain” — you want to be, with both risks and benefits.

tailscale

Some discussion on bsky of the usefulness of Tailscale, and I’ll just note here how very handy it is for running a personal homelab that includes cloud instances. As well as just having lab connectivity from a laptop or phone on the go!

Services I run over Tailscale, just for myself, include:

  • An RSS feed reader
  • A personal git forge
  • An IRC bouncer
  • A (poorly maintained) wiki
  • JupyterLab
  • Open WebUI for playing with local LLMs on a GPU workstation
  • SSH to a powerful workstation, hosted at home but without complex configs

And probably a few things I’ve forgotten! It’s really just very neat. Sure I could do it all with manual Wireguard configs. But Tailscale just makes the underlying primitive much more ergonomic.

Quoting antirez on AI

Anyway, back to programming. I have a single suggestion for you, my friend. Whatever you believe about what the Right Thing should be, you can't control it by refusing what is happening right now. Skipping AI is not going to help you or your career. Think about it. Test these new tools, with care, with weeks of work, not in a five minutes test where you can just reinforce your own beliefs. Find a way to multiply yourself, and if it does not work for you, try again every few months.

Yes, maybe you think that you worked so hard to learn coding, and now machines are doing it for you. But what was the fire inside you, when you coded till night to see your project working? It was building. And now you can build more and better, if you find your way to use AI effectively. The fun is still there, untouched

From Don’t fall into the anti-AI hype

Latency-critical Linux task scheduling for gaming

LWN has an excellent article up on the “latency-criticality aware virtual deadline” (LAVD) scheduler, from a talk at the Linux Plumbers Conference in December.

In particular, I appreciate the detailed discussion of using different profilers and performance-analysis tools at different levels to determine how to optimize scheduling to improve two key goals: providing high average FPS while keeping 99th-percentile FPS as low as possible, e.g. to prevent UI stuttering. Optimizing for battery usage is also important, as the Steam Deck was one of the main targets for this work.

The key finding that came out of his analysis is perhaps somewhat obvious: a single high-level action, such as moving a character on-screen and emitting a sound based on a key-press event, requires that many tasks work together. Some of the tasks are threads in the game process, but others are not because they are in the game engine, kernel, and device drivers; there are often 20 or 30 tasks in a chain that all need to collaborate. Finding tasks with a high waker or wakee frequency and prioritizing them is the basis of the LAVD scheduling policy.

As always with LWN there’s good coverage not only of the talk itself, but also the Q&A following the session and ideas from the audience on tooling and other improvements.

Phoronix also covered a different talk from the same conference (I think) on how Meta is using the LAVD scheduler as the basis for a new default scheduler used on their fleet.

I haven’t had a chance to watch this talk yet (video linked from the article) but I’m very interested in the idea that the same concepts might be useful to a hyper scaler as well as a device like a Steam Deck.

“My Cousin Vinny” as an LLM benchmark

Mike Caulfield wrote a very thorough and quite entertaining article about posing the following question to ChatGPT:

What were Marisa Tomei’s most famous quotes from My Cousin Vinny and what was the context?

Depending on the model selected, the answers to this varied from hilariously wrong, to plausible-but-flawed, to accurate.

Interestingly, substantial test-time compute (“thinking time”) seems to be necessary to do a good job here, despite the easy availability online of famous quotes, plot summaries, and even the script. While the fast-response models available for free were prone to hallucinate.

At the same time I was struck just how much reasoning time needed to be expended to get this task right. It’s possible that My Cousin Vinny is uniquely hard to parse, but I don’t think that is the case. I’ve tried this with a half dozen other films and the pattern seems to hold. If it’s true that a significant amount of similar film contextualization tasks are solvable with test-time compute but require extensive compute to get it right, it seems to me this could be the basis of a number of useful benchmarks.

The full article is well-worth reading, and not only because it discusses My Cousin Vinny in substantial detail (great movie).

On Friday deploys

This post from Charity Majors on Friday deploys is well worth reading.

In the past I’ve seen her comment on how deployments should be carried out fearlessly regardless of when, and I’ve often felt like saying “yeah, well, …”. Because of course I agree with that as a goal, but many real-world orgs and conditions make it challenging.

This most recent post talks about the situations when those freezes can make sense, even if they’re not ideal. And in particular I like the discussion about what really needs to be frozen is not deploys, but merges:

To a developer, ideally, the act of merging their changes back to main and those changes being deployed to production should feel like one singular atomic action, the faster the better, the less variance the better. You merge, it goes right out. You don’t want it to go out, you better not merge.

The worst of both worlds is when you let devs keep merging diffs, checking items off their todo lists, closing out tasks, for days or weeks. All these changes build up like a snowdrift over a pile of grenades. You aren’t going to find the grenades til you plow into the snowdrift on January 5th, and then you’ll find them with your face. Congrats!

Why generic software design advice is often useless

In You can’t design software you don’t work on, Sean Goedecke discusses why generic advice on the design of software systems is often unhelpful.

When you’re doing real work, concrete factors dominate generic factors. Having a clear understanding of what the code looks like right now is far, far more important than having a good grasp on general design patterns or principles.

This tracks with my experience not just of software systems, but also systems with a hardware component (eg ML training clusters) or a facility component (eg datacenters). The specifics of your system absolutely dominate any general design guidance.

As the manager of a team that publishes reference architectures, I do think that it’s helpful to clearly understand where your specific design differs from generic advice. If you’re going off the beaten path, you should know you’re doing that! And be able to plan for any additional validation involved in doing that.

But relatedly, this is part of why I think that any generic advice should be based on some actually existing system. If you are telling someone they should follow a given principle, you should be able to point to an implementation that does follow that principle.

Or else you’re just speculating into the void. Which admittedly can be fun but is not nearly as valuable as speaking from experience.