I’ve built a lot of HPC clusters, and they’ve often looked very different from each other depending on the particular hardware and target applications. But I almost always find myself installing a few common tools on them, to make their management easier, so I thought I’d share the list.
ClusterShell: While in theory you don’t want to ssh into servers outside configuration management, sometimes it’s handy to just run the same command on a bunch of machines at once. (Hopefully doing something read-only!) ClusterShell provides a very helpful command for doing this, and comes with a useful Python library for doing the same from within your own automation.
Node Health Check: NHC does what it says, it checks the health of your nodes! It’s a simple little utility that consumes a config file defining your checks; runs each check in order; and either exits successfully if everything passes, or fails and tells you why. You can then hook NHC into your cluster manager of choice to automatically down nodes that don’t pass their tests.
Dnsmasq: It has many features, but my favorite is that you can configure it to read the contents of /etc/hosts and serve that over DNS. This comes in handy a lot, especially when the nodes in your cluster are on a separate network from your larger infra, or when you don’t want to pollute your larger DNS with a thousand compute node names.
Spack or EasyBuild: While they have some differences, each of these tools will automate the process of building and installing a full HPC application stack quickly and easily. This is very useful for when you don’t want to look up the configure options on OpenMPI yet again. My choice of which one to use will often depend on the particular situation, but they both work well.
Conman: This handy little service helps manage the serial consoles of large numbers of physical servers. When you give it a list of consoles (e.g., a list of BMCs that support IPMI), it will attach to each of them and tail the output to a file per machine. This is incredibly useful when machines are having trouble on their regular network interfaces and you want to see what’s going om. You can also use the conman command to open an interactive connection to any one of them.
Powerman: From the same folks who brought you Conman! Powerman provides a handy interface for out of band power control. It’s quite useful when you don’t actually want to start typing long ipmitool commands.