In the course of talking to other tech companies about what they consider the scope of their SRE/DevOps roles, I've realized that the scope of SRE organizations differs substantially across the industry. Many SRE organizations are limiting their potential by hiring teams to do only the work that keeps the service(s) they are responsible for running but not the work that substantially improves the service(s). It feels like their teams are stuck due to being too overwhelmed with the basics to get out of the rut and do more meaningful work.
What I'm dubbing 'Maslow's hierarchy of SRE needs' categorizes the state of a team into the following buckets:
+ physiological health - is the service functioning at all (e.g. not repeatedly hard-down/bleeding revenue)? is the pager quiet enough to get any other work done? are we learning from outages and resolving postmortem action items to avoid repeating the same outages?
+ maintain homeostasis - is it possible to carry out day to day operations (e.g. push code, tolerate machine failures) without excessive manual work? are people automating away manual work?
+ boundaries & objectives - do we have clear scopes for what we're responsible for (e.g. better to be responsible for one thing solidly than many things diffusely), and an agreed-upon SLA/standard that we aspire to achieve?
+ self-awareness - do we know when we deviate from the standards based on metrics so we can take corrective action? conversely, this also means we can ignore noise that isn't tied to these metrics because our monitoring about the things we care about is solid.
+ self-actualization - freedom in time, trust, and ability dimensions to make substantial design improvements to the service (and measure the improvements!)
You don't get to the later stages of the hierarchy of needs without hiring both systems engineers and software engineers - SRE only works at its best if you have people with both skillsets collaborating. If all you're doing is giving people from pure sysadmin backgrounds a shiny devops title and no other support, you're not going to see results that are meaningfully different from the pure operational model of sysadmin work. If you struggle to name the exceptionally strong coders on your team, you're going to have a lot of trouble with the last step of actually getting core service-level improvements delivered (e.g. improving the service components themselves, instead of just rearranging their relationships). If you don't have a solid product dev-SRE relationship with clear boundaries, it's far too easy to slip into the trap of having all the operational work pushed onto SRE without effort put into reducing the total operational burden.
It's fairly easy to spot a well-functioning organization -- if it's primarily doing work in the self-actualization category, everything less complex in the hierarchy is likely to be shipshape. If an organization is stuck earlier in the hierarchy, it requires a great deal of support in order to reach a fulfilled and functional state. The support required takes many forms - upper management support for principled "no"s and enforcing good boundaries with product dev, hiring to ensure the correct breadth and depth of skillsets is present on the team, and vision from the team itself to push towards more sophisticated work rather than becoming comfortable just doing operations.
What can you do as the leadership of an engineering organization if you're looking to make sure your SRE team grows to its full potential? First, hire people who are excited about the scaling/performance/reliability challenges that your product development generalists lack expertise in, not just people to do the grungy work you don't want to be doing. Second, make SRE's goal to change the service based on experience running it, rather than just keeping it running. Third, make sure a majority of your SRE team's time is actually developing projects and learning new things. Finally, empower your SRE team to take full ownership of the service, including backing their ability to say no to product development.
If you don't do these things, you'll have trouble attracting new talent[1], and your best site reliability engineers will eventually become bored and leave for where they can enjoy self-actualization.
[1] For a potential external hire that wants to be doing work towards the latter steps in the hierarchy, it's a rather risky proposition to join a team that is currently stuck. Visibility into the root causes of the stuckness is often opaque from outside the organization, and whether there will be organizational support for making the necessary changes is also hard to assess from the outside. There's always a great feeling of accomplishment from being empowered to fix a situation and doing so, but it's best to avoid the situations where one is set up to fail from the beginning.
I'll also point out that when SWE lower management is directly rewarded for increasing their group's output in ways that decrease the robustness of the system (thus setting off the pager; ignoring the 'physiological health' needs of their SRE group), no amount of hopefulness and positive attitude on the SRE side will help resolve matters.