High-performance computing (HPC) clusters come in a variety of shapes and sizes, depending on the scale of the problems you’re working on, the number of different people using the cluster, and what kinds of resources they need to use.
However, it’s often not clear what kinds of differences separate the kind of cluster you might build for your small research team:
From the kind of cluster that might serve a large laboratory with many different researchers:
There are lots of differences between a supercomputer and my toy Raspberry Pi cluster, but also a lot in common. From a management perspective, a big part of the difference is how many different specialized node types you might find in the larger system.
I’ve built a lot of HPC clusters, and they’ve often looked very different from each other depending on the particular hardware and target applications. But I almost always find myself installing a few common tools on them, to make their management easier, so I thought I’d share the list.
I’ve written several partial versions of this post in various emails and Slack posts, and finally decided I should just put it on the blog.
The tech landscape is complex and picking the right tool is hard, but the vast majority of problems can be solved in a “good enough” way using a wide variety of tools. The best choice is usually the one you know well already. So I tend to think most developers should have a “default tech stack” that they use for most things, only switching when the problem constraints or early experience dictate otherwise.
And here’s mine! This is the list of tools I usually start with, and use most frequently in production. I will frequently adjust some part of this list for any given project, but I find these are usually useful choices. I don’t expect any of these to be very surprising, but I think there’s some value in writing them down.