What is Site Reliability Engineering (SRE)? Short answer: it’s an idea pioneered by Google, with the goal of introducing higher reliability and accountability for software products while leaving ample room for innovation. Long answer: let’s start at the beginning.

The origins of SRE

According to Google’s definition, “SRE is what you get when you treat operations as if it’s a software problem”. The goal is to keep critical systems running despite outages, buggy code and human error. In Google’s particular case, these services include their search engine, Gmail, YouTube and other platforms boasting millions of users that have to be kept in working condition even during major natural disasters.

In many more traditional companies, system and network administration are the tools through which service stability is achieved and maintained. Google, however, chose to shift focus to software. They hired people from both software and system backgrounds to build a mixed team capable of sharing knowledge and learning together. They use software to solve problems that are traditionally resolved by hand and automate a lot of the work.

The difference between SRE and DevOps

Both DevOps and Site Reliability Engineers are popular and current disciplines. They are similar in their goals and certain practices, but it’s not too difficult to draw a clear line between them. DevOps was born as a response to the lack of proper communication between developers and system administrators. Code would often be written without consideration for how it would run in its intended environment. Administrators would be left to themselves with their task to maintain the system. The result was a somewhat adversarial relationship between development and operations teams, which in turn skewed their priorities away from their companies’ business needs.
DevOps bridged that communication gap through new practices and culture. It’s important to keep in mind that the DevOps movement is not prescriptive: it defines what good cooperation between development and operations should look like, but not how to achieve it. Site Reliability Engineering shares the same philosophy, but is more practical in approach. It provides particular ways to measure and achieve reliability for software applications. Too see the similarities between DevOps and SRE, take a look at how Google contrasts the pillars of DevOps against SRE practices.

In a nutshell, SRE improves upon DevOps by providing practical solutions rather than general goals and ideas. DevOps’ assumptions include breaking through organization silos, accepting failure and learning from mistakes, implementing change gradually, taking advantage of automation, and measuring results. SRE provides the tools and practices necessary to achieve all this:

  • shared ownership between development and operations teams, achieved through a uniform set of tools and techniques,
  • introducing rules for managing the risk of new releases,
  • reducing the costs of failure and rapid iteration,
  • automating as much as possible and shifting the team’s focus to the most valuable work,
  • using software to measure and manage system availability.

It’s obvious, therefore, that SRE and DevOps are not two competing approaches. Their shared goal is to break through walls preventing proper communication within organizations, and make software delivery faster, more efficient and more reliable.

What can SRE do for you?

Ecommerce businesses can benefit from SRE in obvious ways. Even a few minutes of downtime can mean huge losses when they happen during major marketing campaigns, for example. And the infrastructure of most online stores needs to support thousands of users looking through high-quality product photos. Maybe even videos or 3D animations. Their operations teams end up putting out fires - some caused by buggy code. By building a Site Reliability Engineering team, many issues could be prevented, and other solved more quickly thanks to good communication and a systemic approach.
SaaS products are another major candidate. For them, uninterrupted uptime and a smooth user experience matter because their users’ ability to do their jobs may hang in the balance. Because of how product development cycles often work out with SaaS applications - that is, they need to take on a modular approach to increase product flexibility and support a number of different user groups - they often struggle with scaling and maintenance costs. An SRE approach, with developers and operations specialists joining forces to push automation and optimization, many of their issues could be solved from the get-go.
Finally, Fintech and Medtech depend on the faultlessness of their products. When the stakes involve people’s lives or livelihoods, it’s obviously crucial to invest in the best possible approach. Among SRE’s many benefits are improved app security and lowered risk of any sort of impactful problems. Because SRE teams don’t create the fake separation of software and infrastructure, they can build better security and reliability measures.

Is SRE the next big thing?

With the growing buzz around Site Reliability Engineering, the approach might seem like a passing fad. Many companies, particularly enterprise-level ones, are picking up SRE to lower risk and better allocate resources. The amount of jobs for Site Reliability Engineers is a clear sign of SRE’s growing popularity. Apple, Twitter and Dropbox are among these employers who’ve jumped on the SRE bandwagon. So are Hulu, Netflix, Amazon and Heroku. These tech giants can hardly be accused of a lack of caution in dealing with their massive infrastructures.
Any company for which reliability is a core value should consider taking on the SRE approach. After all, it’s not an additional cost (likely the opposite!) and it’s not a new trend (Google began establishing SRE in the early 2000s). But like any framework for building and managing teams, SRE should be implemented in a way that fits the company’s individual needs and culture.

If you’re considering using SRE to boost your product’s maintainability while optimizing resource use, let us help. iRonin’s SRE and DevOps experts have the experience to help you build the best approach for your business.