Most organizations want to move quickly, but aren’t willing to trade uptime or quality in order to move faster. The desire for innovation and speed puts pressure on developers to shorten their release cadence, which could leave errors undetected. When the world of developers and operations collide, the organizational boundaries can create roadblocks. Two common approaches to these roadblocks are DevOps or Site Reliability Engineering (SRE), both with the goal to deliver better software. While DevOps wants to reduce barriers to production, SRE looks for automation that can discover failure early.
In this post, we’ll discuss the role of the Site Reliability Engineer and how they mediate the differing needs in an organization to enable a smoother, more resilient release cycle.
What is a Site Reliability Engineer
Often developers see operations as a source of frustration and a hindrance. Operations sometimes see developers as too hasty with their releases at the expense of system reliability. Certainly, DevOps attempts to solve this by removing any barrier from these two. By contrast, Site Reliability Engineers (SRE) serves as the go-between for developers and operations. They look for opportunities to deliver services more efficiently and reliably—in both cases, with automation.
Their job is to foster a culture of collaboration to improve the release cycle. As Patrick Hill, a site reliability engineer for Atlassian, explains, SREs mediate the age-old power struggle between developers and operations teams by removing “the debate over what can be launched and when.”
Core Functions of an SRE
SREs stabilize the release cycle, enhance product support protocols, and achieve overall product reliability. Some common SRE activities are to:
- “Greenlight” or approve releases
- Implement monitoring and alerting systems
- Develop processes to optimize product support and on-call rotations
- Conduct post-incident reviews to identify opportunities for improvement in the release cycle
These can lead to improved software quality, something of specific interest to API development teams.
Benefits of an SRE Approach
This role accomplishes much more than serving as a middleman. The SRE comes with several additional benefits which include:
Provide Visibility into System Health
Before you can improve reliability, you need to be able to identify it. SREs increase the observability of the system by implementing monitoring and alerting capabilities. Then they can use this information to set service KPIs to track information such as system downtime.
Contribute to the Product Roadmap
When it comes to planning the product roadmap, the SRE plays a valuable role. They have the most insight into production environments, and can offer unique insight into how reliability issues affect the business. Using insights from SREs, business leaders can make data-driven decisions to prioritize the roadmap.
Remove Roadblocks
Some software processes have important workflows, such as code reviews, between development and deployment. The SRE identifies unnecessary roadblocks and frictional barriers that prevent operations and developers from working most effectively.
Does Your Organization Need an SRE?
Service Reliability Engineers can benefit any organization. However, the size of your organization may influence how to implement SRE functions into your development cycle.
Smaller Teams
Smaller teams with less complex systems may embed a dedicated SRE into the development team. The benefit of this approach is that the developers and the SRE can form a close-knit relationship that is inherently free of friction associated with organizational boundaries.
Enterprises
Larger teams with more complex systems may want to create a separate SRE team. Given the number of systems in large organizations, these team members can work cross-functionally throughout the organization. In that way, the SRE team can prevent a backlog that sometimes plagues larger organizations.
The SRE’s role as a diplomat between developers and operations can help increase efficiency in the release cycle. With proper monitoring and post-incident improvement, SREs can help any software—including your APIs—be more stable and reliable.