A goal for the new year

Last year was, in many ways, a rather difficult year for me.

There were certainly a lot of good things — we moved into our house, we adopted a puppy, and I switched to an interesting new team at work. But there was also a lot of unpleasantness and stress, and I frequently felt like I couldn’t manage to take a breath for fear of letting some urgent thing go undone.

Some of that is unavoidable, of course. We’re entering the third year of a global pandemic, our political situation in the United States is infuriating and dangerous, and family and work will often pick the worst possible time to spring crises on us.

But in my case at least, a lot of the stress I felt came from how I reacted to things. I felt like I couldn’t slow down, couldn’t sit and think, even though doing so was often exactly what I needed to do. The urgency was often something I imposed on myself, not something that came from the situation.

So, to the extent I have a new year’s resolution, it’s to slow down. To spend more time thinking, and planning, and reading, and less time reacting to whatever I find most stressful in the moment. Overall, to reduce urgency.

We’ll see how it goes, of course. Some things are difficult to control.

Toy programs for learning a new language

It used to be that I’d get interested in a new programming language, but I wouldn’t have a new project to use it for and had trouble knowing how to start. I have trouble really grasping a new language without building something in it, and “X by example” or working through a book don’t really do the job.

What’s helped me lately is to build an array of “standard” toy programs that I understand reasonably well, and that I can use to explore the new language and figure out how to do something real in it.

Right now, my toy program collection consists of:

  • A link shortening service, like bit.ly or tinyurl, along with a HTTP API for adding and removing links
  • A 2D diffusion simulation
  • A “system package inventory” program, that builds a list of all the RPMs/DEBs installed on a Linux machine and shoves them into a SQLite database

This is almost never what I’d call production-quality code. For example, when I’m writing these toy programs, I rarely write unit tests (until I start exploring the test libraries for the language!). But they’re still very valuable learning tools, and give me space to explore some very different use-cases.

I almost always write all three in a given language, but the order depends a lot on what I think the new language will be good for. For example, I’ll write the “system package inventory” program first if I think the new language might be handy for system administration tools. It’s a great way to see how well the language plays with a common Linux environment, how painful it is to use SQLite, and to get practice writing CLI tools in it. I’ll often augment the basic “scan and store” functionality with a CLI to do frequent queries, like “on what date was this package last upgraded”.

On the other hand, if I think I’m going to use the new language for a bunch of numerical work, I’ll start with the diffusion simulation. When I write that, I often start with a naive implementation and then start playing with profilers and other performance tools, or try to parallelize the simulation. This is also a great excuse to dig into any plotting tools commonly used with the language.

These toy programs are also handy if I want to explore new ways to integrate a service into a larger production environment. For example, I might start with the link shortening service, deploying the service itself statelessly and persisting the list of links into a PostgreSQL DB. Then I start complicating things…

  • Let’s add logging!
  • And tracing!
  • It’s always a good idea to expose Prometheus metrics
  • And wouldn’t it be handy to support multiple database backends?
  • Now wrap it all in a Helm chart for handy deployment

I imagine I’m not the only person to have a standard collection of learning projects for new languages. If you do this too, what does your project list look like?

Trying to write more, with less pressure

I’ve been pretty bad at blogging for the past *mumble mumble years*, but it’s not for lack of writing.

The thing is, I like writing. I have a rather large drafts folder filled with work-in-progress posts, not to mention all the various brainstorming docs I have for work, D&D, and other writing tasks. Those WIPs are frequently five or ten pages long, with lots of little notes on extra bits I should add to avoid missing things.

Like plenty of other folks, my problem isn’t writing, it’s finishing things.

However, this blog is titled “thinking out loud”! I don’t need to write the definitive article on a given topic, or at least I don’t need to do it here. Instead, I want this to be a place where I can get thoughts out in front of people (and myself!) so I can make them better.

To that end, I’m setting a goal for 2022 to write here:

  • At least once every two weeks
  • With only light editing
  • At most two pages of text in a post
  • And being willing to delete anything I decide I hate! 😅

To make my life easier, I’m cheating a little: I’ve written four short posts in the past week, and set them to auto-publish every two weeks!

That should get me through February. In the mean time, I’ll keep writing — with any new posts either being added to the queue, or potentially posted live if I have a particularly hot take. With luck, this process will help me stick to my goal despite any temporary crises or fits of ennui, and keep my momentum up.

Happy New Year!

SRE to Solutions Architect

It’s been about two years since I joined NVIDIA as a Solutions Architect, which was a pretty big job change for me! Most of my previous work was in jobs that could fall under the heading of “site reliability engineering”, where I was actively responsible for the operations of computing systems, but my new job mostly has me helping customers design and build their own systems.

I’m finally starting to feel like I know what I’m doing at least 25% of the time ? so I thought this would be a good time to reflect on the differences between these roles and what my past experience brings to the table for my (sort of) new job.

Continue reading

Sketching out HPC clusters at different scales

High-performance computing (HPC) clusters come in a variety of shapes and sizes, depending on the scale of the problems you’re working on, the number of different people using the cluster, and what kinds of resources they need to use.

However, it’s often not clear what kinds of differences separate the kind of cluster you might build for your small research team:

Note: do not use in production

From the kind of cluster that might serve a large laboratory with many different researchers:

The Trinity supercomputer at Los Alamos National Lab, also known as “that goddamn machine” when I used to get paged at 3am

There are lots of differences between a supercomputer and my toy Raspberry Pi cluster, but also a lot in common. From a management perspective, a big part of the difference is how many different specialized node types you might find in the larger system.

Continue reading

handy utilities for every hpc cluster

I’ve built a lot of HPC clusters, and they’ve often looked very different from each other depending on the particular hardware and target applications. But I almost always find myself installing a few common tools on them, to make their management easier, so I thought I’d share the list.

Continue reading

my default technology choices

I’ve written several partial versions of this post in various emails and Slack posts, and finally decided I should just put it on the blog.

The tech landscape is complex and picking the right tool is hard, but the vast majority of problems can be solved in a “good enough” way using a wide variety of tools. The best choice is usually the one you know well already. So I tend to think most developers should have a “default tech stack” that they use for most things, only switching when the problem constraints or early experience dictate otherwise.

And here’s mine! This is the list of tools I usually start with, and use most frequently in production. I will frequently adjust some part of this list for any given project, but I find these are usually useful choices. I don’t expect any of these to be very surprising, but I think there’s some value in writing them down.

Continue reading

Some thoughts after reading Vincenti’s “What Engineers Know and How They Know It”

A few weeks ago I watched Hillel Wayne’s recent talk “Are we really engineers?”, where he looked at the idea of whether software engineers get to call themselves “engineers” or not. (Spoiler: the answer is yes!)

During the Q&A, Wayne mentioned that while he had seen a lot of “philosophy of science”, there didn’t seem to be much “philosophy of engineering” out there. I remembered noticing the same thing, and on Twitter I asked for book recommendations on the topic. The always-reliable Lorin Hochstein obliged, and a week later I had some reading to do!

Just as a disclaimer: this post is very much in theme of “thinking out loud”, and got a little long. 🙂 This is mostly me discussing my experience of reading the book and some thoughts on software engineering I had after reading it. Very likely nothing here is at all original, and I am not an expert, but I wanted to get my ideas down in text after finishing the read. And having done so, I thought it might be worthwhile to share.

Ok, let’s dive in.

Continue reading

Questions your users will probably ask about the shared cluster

(Not intended to be exhaustive. ?)

On failures:

  • Why did my job fail?
    • Ok, I saw the error code, but what did that actually mean?
    • Can you make changes on the cluster itself so this will succeed next time?
  • What physical node did my job run on?
    • How can I make sure my retries don’t run on that node again?
    • Ok, I understand that the machines are identical, but can I just make sure to never run on that node ever again?
  • What can I do to prevent this failure from happening again?

On priorities or “quality of service” mechanisms:

  • Why is my job taking so long to run? (And/or, why are my requests being throttled?)
  • How can I increase my priority to avoid this throttling again?
  • Why is that other user’s job running with a higher priority of mine?
    • Can you please throttle their jobs which are getting an unfair priority?
  • I have a critical deadline! Can you please override the priority score so my job runs immediately?
  • Can you reserve some set of resources for my dedicated use for some amount of time?
  • Can you provide some guarantee that my jobs will always run within a specified time?
  • Can I get an interactive way to run on the machine? I don’t want to deal with writing a job script.
    • (This may be requested for a single node, or at any scale up to the full cluster!)

On the development environment:

  • Can you please provide package X on the system? (Where X may be something the admin team has never heard of)
  • Why doesn’t the cluster provide the newest version of X? (Or an updated version of some API)
    • When can I get access to the newest version?
  • Why did you upgrade the cluster so quick? Now my workflow is broken.
    • Can you please go back to the version that works for my job?

Now, I’ll grant you that I can get a little snarky sometimes, but all of these questions may have some valid business reason!

Even seemingly obnoxious requests, like the user who shows up on a Friday asking for exclusive access to the full cluster, might actually turn out to be the most important thing to do in that context.

And valid or not — most of these are questions the users of any shared resource will eventually ask! I’ve run across too many cluster stacks that can’t actually inspect their priority system; don’t provide any tracing for failures; or can’t even tell the user which machines they ran on.

One way or another, you should probably be able to answer these kinds of questions, or you’ll have a lot of trouble operating your system over time.

Having trouble with fun in the time of COVID

I have a lot of friends who are struggling with focus at work given our current shelter-in-place conditions. I’m running into a little bit of this myself — I think my productivity is a little bit lower than usual, even though I’m used to working at home. But for the most part I’m doing ok getting work done.

Instead, what I’m failing at is relaxing. I’m finding it increasingly hard to focus, or enjoy myself at all, when I don’t have a clear “to-do” item in front of me.

  • I’m having trouble with any kind of fiction reading, which is usually not a problem at all.
  • I’ve been taking more walks, but I spend them either thinking about work or worrying about the state of the world in general.
  • I can occasionally fool myself into baking, but only if I think of it as “I need this loaf of bread” instead of “it’s fun to bake!”
  • While Leigh and I are slowly working our way through Star Trek: Deep Space Nine, I’ve watched it enough times that it’s as much background noise as media for me.

Instead, when I have any kind of downtime I just feel anxious. I stare at Twitter or the news, or just sit there and worry.

Needless to say this is not any good for work-life balance! I’ve been doing okay at not over-working, mostly thanks to Leigh and the cats, but I’m not sure hours of free-floating anxiety is all that much better.

Anyway. Not sure there’s a point to this, but that’s my current quarantine experience. If this sounds familiar, feel free to shoot me an email and happy to chat! (Apparently I could use the distraction…)