DevOps has become a global phenomenon, empowering organizations worldwide with far greater levels of efficiency, collaboration, and reliability in creating and releasing code. However, it is not a simple fix; rather, it is made up of several elements and roles which are integral to being able to enjoy the full benefits of the methodology.
A practice that is often used to further optimize DevOps is ‘site reliability engineering (SRE)’, a practice for ensuring the stability of the development (or ‘Dev’) environment. The concept of SRE is older than DevOps itself, having been originally pioneered by Google. Indeed, they are often treated as being separate, with ‘DevOps vs. SRE’ being a common topic within the practitioner community.
The reality is that there are a fair few similarities between SRE and DevOps. According to the man behind it, Ben Treynor, SRE is “what happens when a software engineer is tasked with what used to be called operations.”
In other words, it helps those working in development teams to take the considerations of operations into account. In practice, this helps these teams increase accountability, reliability, and innovation, balancing the drive for releases with the need for stability at the point of deployment.
SRE Foundation (SREF) is a qualification that quantifies Google's concept, providing essential tools, principles, and best practices for site reliability engineers. It was developed with insight from practicing SRE specialists and equips site reliability engineers to create significant tangible value.
So, what exactly is SRE, how does it work, and what is it's relationship to DevOps?
How does SRE work, and how does it relate to DevOps?
Historically, development and operations (Dev and Ops) teams have struggled with setting boundaries. While the Dev side wants to keep churning out new ideas, features, and so on, Ops is more concerned with how releases are managed and deployed. In short, one side wants to shake things up, while the other wants to keep things stable.
However, developers cannot be expected to fully take on all responsibilities for operations - there simply aren’t enough hours in the day! Instead, deployment, configuration, monitoring, and other elements of operations must be overseen by a specialist.
This is where SRE comes in. SRE teams understand how code is managed and deployed and can provide this expertise to development teams, fulfilling many of the roles associated with Ops. In essence, they are able to govern developers according to a set formula based on aspects like product performance and system reliability, removing a great deal of confusion and competition within and between different teams.
A key part of how this works is the idea of a 'Service Level Agreement (SLA)' for each service. This defines how reliable the service must be for end-users and clients in order to be viable. SREs will also establish an ‘Error Budget’: the maximum threshold for errors and outages for a service. If this is exceeded, or if the project consistently operates at a level below the quality level defined by the SLA, new releases will be frozen until the number of errors is reduced to an acceptable level. As well as establishing a greater level of clarity for measuring performance, this also gives site reliability engineers and developers a strong incentive to minimize potential errors and collaborate closely in order to do so.
Product performance is also an essential element of what ideas get green-lit. For example, if a team works according to an error budget of 0.1%, and no errors are made, this budget can then be spent elsewhere, such as on new project ideas.
Two other aspects of this are ‘Service Level Objectives (SLOs)’ and ‘Service Level Indicators (SLIs)’. An SLO is a defined individual metric that contributes towards an SLA. For example, a company could decide that average response times are key to a service’s reliability; and so, response time will be an SLO. An SLI, meanwhile, is a measurement of a service that can be compared with an SLO. The key difference is that an SLI is not what a metric needs to be but what it currently is.
SREs will typically spend about 50% of their time on operations work, with the rest spent on writing code and building systems to boost performance and operational efficiency. They are known for having the ability to move between teams, departments, and even businesses as necessary. This has the advantage of creating additional bridges through which different sides can share insight on production, products, coding, and so on. Development teams will typically spend about 5% of their time working in operations, keeping them more connected to products and how they are performing while also conditioning them to predict and avoid common operations issues in advance.
How can SREF Help my Business?
The SRE Foundation (SREF) certification teaches a number of set principles and practices designed to help organizations reliably and economically scale services. Students will be capable of freeing developers from a number of responsibilities while also enabling huge improvements in productivity and efficiency.
By having staff follow SRE principles, you can make your organization’s computer systems far more stable, predictable, and scalable - all essential elements of software engineering, development, and operations. You will also have vastly improved communication and understanding between development and operations teams. Site reliability engineers will be proactive in finding and repairing problems, with clear processes and metrics to provide context for their work. All of this will add up to a far larger proportion of releases being made speedily and successfully.
It is worth noting that SREF also covers how to track KPIs in the context of a wider business. For example, students will learn to track aspects like site health, as well as the financial costs of lost productivity or service downtime. This can be highly valuable to other departments, such as customer support, marketing, and sales, all of which will be able to see the value of increased reliability.
Studying SREF with Good e-Learning
SRE Foundation (SREF) is a standalone certification designed for both experienced and prospective site reliability engineers. These typically come from a number of backgrounds, though most tend to be system administrators or software engineers. The certification itself does not have any prerequisites, though it is highly advised that prospective students are at least familiar with SRE as a concept.
Good e-Learning is an award-winning online training provider, as well as a Trusted Education Partner of the DevOps Institute. Our course examines the evolution of SRE, as well as its direction for the future. Students are provided with a variety of tools, methods, and practices for optimizing stability and reliability, such as how to understand and track service level objectives (SLOs).
Good e-Learning offers a number of DevOps & SRE courses:
Each of our courses comes with a number of online training assets created alongside highly experienced site reliability engineers and DevOps practitioners. These include instructor-led videos, interactive slides, and several case studies to help students understand how to apply their new knowledge. We also offer a practice exam simulator, as well as technical and course support. Learners can even access the courses via our mobile app for learning on the move.