
This video can't play due to privacy settings
To change your settings, select the "Cookie Preferences" link in the footer and opt in to "Advertising Cookies."
SREs on a Plane
What is site reliability engineering? Is there a difference between an SRE and DevOps? And is something like AI Ops just a flight of fancy? In this episode, Senior SRE Candace Sheremeta joins Red Hat CTO Chris Wright for a 10,000 ft view of the role of the SRE in the modern world of operations as we shift from static configurations to code as infrastructure. We'll cover how it's kind of like Passenger 57 but with root cause analysis. So join us to find out more about the people who keep our operations running smoothly and keep us airborne in the hybrid cloud.
Transcript
Transcript
00:01 - Chris WrightOne of the things I love about my job is that I get to explore the tech and tools that make our lives easier. But easier doesn't always mean simpler. Take the history of flight, circumnavigating the globe in 80 days isn't as impossible as it seemed back in Jules Verne's day, but air travel has transformed dramatically since then. We've gone from crazy contraptions with heavy flapping wings to sleep jets with autopilot and advanced systems. Likewise, in the world of operations, we're shifting from static configuration to code as infrastructure as we build cloud-based SaaS platforms and AI tools that people depend on every day to keep operations running smoothly. But as these systems get more complex, who keeps us airborne in the hybrid cloud?
00:50 - INTRO ANIMATION
01:00 - Chris Wright
Cloud computing, and more specifically a platform like Kubernetes, is like this idea of a journey around the globe. We have infinite variables to consider. It can seem easy to just hop on a plane and arrive at your destination, but you're not the one flying the plane. There's a pilot in the cockpit with thousands of logged hours, experience with weather patterns, weight distribution, navigation tools, and automated systems. This is essentially the role of the SRE. Site Reliability Engineering is a software development discipline that combines both systems and software engineering principles to optimize the reliability, efficiency, and scalability of complex systems. SRE teams use and build software to manage systems, solve problems, and automate operations tasks. So, to get an idea of the expertise of an SRE, let's talk to one.
01:59 - Candace Sheremata
Hey Chris, how are you?
02:01 - Chris Wright
Hey, Candace. I've been really digging into the concept of SREs and I wanted to get your viewpoint and understand from your perspective, some of the details around Site Reliability Engineering. So, why don't you just start with the very basics. From your point of view, how would you describe what is an SRE and what do SREs do?
02:21 - Candace Sheremata
Sure, so SRE is a Site Reliability Engineer and it's basically a DevOps position. So, we have two basic functions behind the role, the development side and the Ops side. So, for the Ops side, that's pretty easy. What we do is we pay attention to the alerts that are coming in from our cluster. So, anything that might be broken on the clusters, we get alerts for those. And then we also look at any sort of customer issues that may be coming in through tickets. And then our development work revolves around a couple of different things. So one, we do feature work where customers ask us to add certain features to the OpenShift Dedicated product and we work on those. And then, of course, we want to do development work to reduce Ops pain. And so, we do a lot of automation and things to help us lower the number of alerts we get on any given day, to help the customer issues that we see, automate those away, things like that.
03:20 - Chris Wright
What happens when something goes wrong? I mean, you're getting all these alerts, there's a lot happening. Can you walk us through that process?
03:30 - Candace Sheremata
When things go really wrong we call them incidents. And we have an incident response process where basically we get a team of our SREs together who work to resolve the issue. And then we also have a post-mortem process. So, we write up an RCA doc, a Root Cause Analysis, and we go through in a meeting together, post-mortem review meeting, what went wrong, how long it took us to resolve certain things and we talk a lot about, we wanna create action items to either decrease time to resolution for the next time this thing goes wrong or completely automate the problem away altogether.
04:11 - Chris Wright
I think that's an awesome process and the post-mortem or retrospective view. I mean, that's applied in various sort of disciplines, development as well. I'm interested in that part of learning what went wrong, Root Cause Analysis. That itself can be difficult. I've worked on systems where an initial issue creates a flood of alerts and those all become distractions from the real issue.
04:34 - Candace Sheremata
Absolutely.
04:36 - Chris Wright
I've also worked on systems where we've tried to work on that closed loop remediation. You talked about automating it so it never goes away. I can't help but think, can't we just write some code and AI that does all this magic for you?
04:51 - Candace Sheremata
It would be really awesome if we could just insert an AI module into our clusters and have all the problems go away. But unfortunately, that's not really how it works. AI isn't magic. It's really humans writing code. And so, what we have to do as SREs is we have to have the problems happen and then look at the problems and say, "What kind of code can we write here to make these problems go away or to make these problems less bad when they happen again?
05:20 - Chris Wright
I love that mindset of a little bit of experimentation, maybe in some manual work to figure out what's happening and then apply that through all of your experience, develop code, deploy an operator that can take that learning and automate that learning into helping you run production environments. It feels to me, and I know we've been experimenting with how this could work, that you could take that same concept and expand it out to a broader community. And thinking about the open source community development model, how do we apply that to operations?
05:55 - Candace Sheremata
Absolutely, so that's what we mean when we were talking about operate first. Operate first is the idea that we are taking our own products and we are deploying them at some large scale in order to be able to give feedback to the developers about what it's like to work in operations with those products deployed at a large scale. So, we have a lot of initiatives that help that feedback loop with our OpenShift developers. For example, we are asking our developers to come in and do shadowing with the SRE team so that they can see what it's like to run their product at scale and they can see what it's like to be getting alerts from the clusters and they can see the sorts of requests that we get from customers and things like that. And they can take that information and that knowledge back to their teams and say, "You know, okay, this one alert is really noisy. Let's try to make sure this alert is less noisy for our SRE team. Or, our SRE team wasn't getting alerted at all about this one issue that we think is really important. So, maybe we should put in alerts for these sorts of issues."
07:12 - Chris Wright
I mean, I can say from my own personal experience, having lived more on the developer side, you don't always appreciate the challenges that you're introducing for operations in the code that you're writing. So, I can, I really love that notion of learning from one another and take it a step further, how we can change the systems and improve those systems, learning from the operations experience, informing the developers of the platform. So, this is a community effort. This is how we collaborate and this is the beauty of open source development. Really great conversation. I'm so glad to have an opportunity to learn from you more, a day in the life of an SRE. Thank you so much, Candace.
07:51 - Candace Sheremata
Thank you so much for having me, it was lovely.
07:55 - Chris Wright
The history of flight is long and storied and in the beginning it was 100% manual, but over time it's become more and more digitally automated. And now we have autopilot, which largely flies the plane. But when unexpected conditions arise, you're still trusting your life to a human expert. The pilots feedback and the operations data are vital for ground teams and this helps to continually optimize systems for safer and more efficient travel for all airlines and all passengers.
08:26 - OUTRO ANIMATION
About the show
Technically Speaking
What’s next for enterprise IT? No one has all the answers—But CTO Chris Wright knows the tech experts and industry leaders who are working on them.
