in Computing, random rambling

SRE to Solutions Architect

It’s been about two years since I joined NVIDIA as a Solutions Architect, which was a pretty big job change for me! Most of my previous work was in jobs that could fall under the heading of “site reliability engineering”, where I was actively responsible for the operations of computing systems, but my new job mostly has me helping customers design and build their own systems.

I’m finally starting to feel like I know what I’m doing at least 25% of the time 😉 so I thought this would be a good time to reflect on the differences between these roles and what my past experience brings to the table for my (sort of) new job.

(Just a note: I feel like job titles for ops folks are a fraught topic. My job titles have included things like “Production Engineer”, “HPC Cluster Administrator”, and “HPC/Cloud Systems Engineer”. I tend to self-identify more with the term “sysadmin”, but I’m using “SRE” as the most current term that captures the work I’ve spent a lot of my career doing, where I generally approached ops from a software engineering perspective. Feel free to substitute your job title of choice!)

I spent most of the past 10 years building and running large computing systems. With the exception of ~18 months working on backend storage for a fairly large website, I’ve mostly worked on large high-performance-computing (HPC) clusters. These systems are generally used by researchers and engineers to run simulations and data analysis. The teams I joined were generally responsible for building these clusters, keeping them running, helping the researchers who used them, and making sure they performed well.

In my day-to-day work in SRE (or whatever you call it), I mostly thought about problems like:

  • Are my team’s services operating reliably and predictably, according to our defined metrics?
    • Translated: What’s broken today?! 😉
  • Are our (internal) customers having a good qualitative experience?
  • For any current or recently-past incidents, how can we understand what went wrong and incorporate that into our future development?
  • What major features or other changes are we hoping to release soon? How can we be confident they’ll work correctly and reliably?
  • Are we expecting to have to turn up more capacity or new systems soon? Are we ready to do so?
  • What projects can I pursue to automate anything boring that I have to work on?

My role as a solutions architect is rather different, as I don’t actually have any services I’m responsible for keeping online. Instead, I’m generally working with external customers who are working with our products and using them in their own production environments. Because I’m focused on HPC and supercomputing, my customers have generally purchased NVIDIA’s hardware products, and are operating them in their own datacenters. I’m frequently talking to the SRE teams, but I’m not part of them myself.

In my daily work as a solutions architect, I’m thinking more about questions like:

  • Do my (external) customers have a good understanding of what our products are and how to use them?
    • This may include products they already use, or new products that they may be planning to deploy
  • What are their pain points, and how can I feed that back to the product teams?
    • And also: What new product developments can I provide pro-active advice on before it makes it to the customer?
  • What new customer deployments are coming up, and how can I help them go smoothly?
  • How are our customers doing running their current clusters, and are they feeling a lot of pain?
  • What tools can I develop, or what content can I write, to help all of the above go well?

On the one hand, I work on a lot of the same problems as a solutions architect as I did in SRE. I still spend a lot of time thinking about the scalability, performance, and reliability of HPC systems. I still care a lot about making sure the systems I help build are useful and usable for researchers.

On the other hand, I’m not so much on the pointy end of these problems anymore. My work is mostly focused on enabling others to run reliable systems, rather than being directly on the hook for them. And while I do help directly manage some internal lab clusters, those systems have very loose SLOs. So in practice I haven’t been on call in about two years.

I do think my experience in SRE has been really important in doing a good job in solutions architecture. I like to think I have a pretty good instinct for systems design at this point, and I can often help identify problems and bottlenecks in early stages. My troubleshooting skills from SRE work are incredibly helpful, as a lot of my work is helping customers understand what the heck is broken on their clusters. And I also find that it really helps to have someone who “speaks the same language” as the SRE teams for our customers, especially because I feel like so many vendor relationships neglect reliability concerns in favor of features.

The transition has been really interesting, and I’m still conflicted about which kind of job I prefer. I don’t exactly miss being on call… but I do miss somewhat the more visceral feeling of understanding a running system really well through sheer continuous contact with it. However, I really love helping my customers build cool systems, and I like the satisfaction of helping many different teams do well, versus focusing tightly on a single service.

I’m really enjoying the solutions architect gig right now, but I also wouldn’t be surprised if I ended up doing SRE work directly again at some point.