What is SRE? Everything you Need to Know

Share on:
Site Reliability Engineering

What is SRE?

DevOps has become a global phenomenon, empowering organizations worldwide with far greater levels of efficiency, collaboration, and reliability in creating and releasing code. However, it is not a simple fix; rather, it is made up of several elements and roles which are integral to being able to enjoy the full benefits of the methodology.

A practice that is often used to further optimize DevOps is ‘SRE’. SRE stands for ‘Site Reliability Engineering’, a way to guarantee the stability of the development (or ‘Dev’) environment. The concept of SRE is older than DevOps itself, having been originally pioneered by Google. They are often treated as being separate, with ‘DevOps vs. SRE’ being a common topic within the practitioner community.

The reality is that, while there are several differences between DevOps and SRE, there are also a fair few similarities. According to the man behind it, Ben Treynor, SRE is “what happens when a software engineer is tasked with what used to be called operations.”

In other words, it helps those working in development teams to take the considerations of operations into account. In practice, this helps these teams increase accountability, reliability, and innovation, balancing the drive for releases with the need for stability at the point of deployment.

SRE Foundation (SREF)’ is a qualification that quantifies Google’s concept, providing essential tools, principles, and best practices for site reliability engineers. It was developed with insight from practicing SRE specialists and equips candidates to create significant tangible value. SRE Practitioner (SREP) goes into greater detail on advanced SRE elements, such as chaos engineering, building secure and reliable systems, and implementing SRE practices.

So, what exactly is SRE, and how does it work?

How does SRE work, and how does it relate to DevOps?

Historically, development and operations (Dev and Ops) teams have struggled with setting boundaries. While the Dev side wants to keep churning out new ideas, features, and so on, Ops is more concerned with how releases are managed and deployed. In short, one side wants to shake things up, while the other wants to keep things stable.

However, developers cannot be expected to fully take on all responsibilities for operations – there simply aren’t enough hours in the day! Instead, deployment, configuration, monitoring, and other elements of operations must be overseen by a specialist.

This is where SRE comes in. SRE teams understand how code is managed and deployed and can provide this expertise to development teams, fulfilling many of the roles associated with Ops. In essence, they are able to govern developers according to a set formula based on aspects like product performance and system reliability, removing a great deal of confusion and competition within and between different teams.

A key part of how this works is the idea of a ‘Service Level Agreement (SLA)’ for each service. This defines how reliable the service must be for end-users and clients in order to be viable.

SREs will also establish an ‘Error Budget’: the maximum threshold for errors and outages for a service. If this is exceeded, or if the project consistently operates at a level below the quality level defined by the SLA, new releases will be frozen until the number of errors is reduced to an acceptable level. As well as establishing a greater level of clarity for measuring performance, this also gives site reliability engineers and developers a strong incentive to minimize potential errors and collaborate closely in order to do so.

Product performance is also an essential element of what ideas get green-lit. For example, if a team works according to an error budget of 0.1%, and no errors are made, this budget can then be spent elsewhere, such as on new project ideas.

Two other aspects of this are ‘Service Level Objectives (SLOs)’ and ‘Service Level Indicators (SLIs)’. An SLO is a defined individual metric that contributes toward an SLA. For example, a company could decide that average response times are key to a service’s reliability; and so, response time will be an SLO. An SLI, meanwhile, is a measurement of a service that can be compared with an SLO. The key difference is that an SLI is not what a metric needs to be but what it currently is.

SREs will typically spend about 50% of their time on operations work, with the rest spent on writing code and building systems to boost performance and operational efficiency. They are known for having the ability to move between teams, departments, and even businesses as necessary. This has the advantage of creating additional bridges through which different sides can share insight on production, products, coding, and so on. Development teams will typically spend about 5% of their time working in operations, keeping them more connected to products and how they are performing while also conditioning them to predict and avoid common operations issues in advance.

How can SRE help my business?

SRE training courses go into great detail on the tools, principles, and practices that help organizations release market-leading code efficiently and reliably. Successful candidates become qualified for SRE jobs within their company, allowing them to support developers as necessary while also investing time in major pipeline improvement. SRE courses also cover popular SRE tools, how to become a site reliability engineer, and so on.

Some of the biggest benefits of SRE training include:

  • Making computer systems and pipelines more stable, predictable, and scalable
  • Improved communication and understanding between development and operations teams
  • A proactive approach to finding and repairing problems
  • Clear success metrics for SRE and DevOps
  • A higher number of speedy, high-quality releases

It is worth noting that the Site Reliability Engineering Foundation (SREF) syllabus also covers how to measure SRE in the context of a wider business. For example, students will learn to track aspects like site health, as well as the financial costs of lost productivity or service downtime. This can be highly valuable to other departments, such as customer support, marketing, and sales, all of which will be able to see the value of increased reliability.

Is Studying SRE Worth the Cost?

Site reliability engineer salaries are often a large draw for potential candidates. SRE engineer jobs are challenging and highly rewarding, with a growing number of companies utilizing SRE practices as they look to make code pipelines more efficient, constructive, and reliable.

According to Payscale, the average base salary for a site reliability engineer is:

  • $76,000 to $158,000 in the US
  • £37,000 to £92,000 in the UK

It goes without saying that experience is vital for reaching high-level SRE jobs. However, SRE certification can certainly make a candidate stand out and help get their foot in the door. This is particularly true for larger companies that put greater faith in official qualifications, as well as candidates that rely on recruiters to unlock SRE job interviews.

Related course: