Site Reliability Engineering is a less known methodology that expands on the DevOps approach. Pioneered by Google, SRE helps businesses lower operational costs, automate and monitor their infrastructures better, fix communication issues and speed up product development. Becoming a Site Reliability Engineer requires an eclectic mix of development and SysOps skills, as well as soft skills that improve communication. We’ve spoken to Daniel, our resident Site Reliability Engineer, about his work and SRE as a trend, to find out when a company should use the approach and what it takes to follow the SRE philosophy.

The difference between DevOps and SRE

To avoid confusion, let us lay down the two basic terms. Site Reliability Engineering is an approach meant to fully utilize the resources of a company that has software development at the core of their business. It means deploying software developers to solve operations problems - and doing so successfully. The traditional divide between SysOps and development is a thing of the past with SRE, which gives companies practical solutions and tools to build a new product support culture.

DevOps is a methodology with very similar goals. It could be said that DevOps is part of SRE, as SRE uses many elements of the approach, such as heavy automation. The main difference between them is that DevOps is less concrete, a little less practical - it offers teams an ideology and goals to work towards, but doesn’t set them up with all the necessary tools.

How SRE can help your company grow

The SRE approach is meant to minimise the costs, time and effort involved in the software development process. It assumes that any solution used in one project should be reusable in future projects (so it should be modular), and not hardcoded for use in only one situation. SRE understands that mistakes happen - we need to learn from them to avoid repeating them, and in this way, turn them into value.

SRE relies on measuring and monitoring the entire project infrastructure, as well as each component within it. When something goes wrong, we should see when, where and how to fix it. At iRonin, we have a fully automated and monitored environment, in accordance with SRE’s principles. All this means that we don’t need to constantly backtrack whenever we face an issue, and that our resources can be funnelled into growth.

A Site Reliability Engineer often fulfils an advisory role: he comes in as a consultant and helps development teams choose the best tools for a job, lay down the foundations for operations support, and shares their knowledge with the developers, who then use it to shape the product according to their needs. This mentor can work with several teams in a company and be there for them when particularly complex issues occur. This means the problems can be fixed quickly by an expert, while the development team enjoys a learning opportunity.

The problems that SRE solves

First of all, there’s no need to hire a separate DevOps or SRE engineer for every project team. You can have one SRE team that helps all development teams understand all internal processes integral to their work. The Site Reliability Engineer participates in the projects fully at first, but their ultimate goal is to empower developers to take control of the infrastructure and introduce necessary changes without having to rely on anyone else. This can have a huge positive impact on development time and it means a very efficient use of resources.

But doesn’t DevOps accomplish the same goals? Not entirely. The DevOps approach advocates having a DevOps expert for every development team, as a liaison between developers and operations experts. While this is a working solution (and a good one for many companies, as we’ll state again below), it’s simply less efficient and more expensive. Instead of using existing resources to the fullest (i.e. empowering developers to help themselves), the DevOps approach relies on adding a new member to the development team.

How to introduce SRE at your company, and why

While a great approach in many cases (we use it ourselves!), SRE is not always easy to bring into an established company structure. Some teams will push back against this sort of change, and others might lack the skills and experience to apply SRE properly. If you do decide to use the SRE approach, however, you’ll enjoy a number of advantages.

Getting started:

  • SRE requires a highly motivated, knowledge-hungry team of developers ― sometimes, the fact that there’s no dedicated DevOps expert in a team is a problem. Developers don’t always want to learn the SRE approach and the skills they need to use it, or they don’t have the time to do so as their projects keep them extremely busy. In these situations, they’re going to want a dedicated DevOps expert.
  • SRE needs both hard and soft skills ― it’s not easy to find people who are skilled DevOps experts as well as effective mentors and teachers. The communication and leadership skills required for the role of a Site Reliability Engineer make finding good candidates a challenge.

Advantages:

  • Automation ― there’s no need to waste time on completing manual tasks - at iRonin, we run our code once and it works on many environments. We don’t need to redo certain tasks and can focus fully on giving each project our best.
  • Monitoring and analysis of infrastructure ― thanks to our SRE approach, we know our infrastructure well and understand why failures occur, so we can fix them faster and prevent them from occurring in the future.
  • Lower costs of delivering software ― there’s no need to have a separate DevOps expert for each project. In our case, we take the time to introduce the SRE approach to everyone, and then the duties are dispersed within the development team. It’s very efficient.
  • Breaking through silos ― SRE, like DevOps, removes communication issues between development and operations teams.
  • Fast delivery ― thanks to SRE, we can improve our speed of development and the productivity of iRonin’s development teams. Developers never need to wait for someone else to solve a problem for them, and they don’t waste time on communicating the issue - they simply deal with it themselves.

Should every company use SRE?

Short answer: no. As we mentioned above, sometimes DevOps is the better option, particularly if your company only has one project team. In that case, the effort of finding a Site Reliability Engineer might not be worth it. Companies that that juggle several projects at once at any given time are the ones that can benefit the most from introducing SRE into their processes.

Companies always jump at the chance to lower their costs - and SRE has a lot of potential for that. For the people becoming SRE engineers, on the other hand, it’s an opportunity to develop professionally. Daniel, our Site Reliability Engineer, believes that nothing teaches you more than working on all kinds of different projects. He says that if you get stuck in one environment, you have to put serious effort into finding learning opportunities. With a wide variety of projects, on the other hand, learning is the natural outcome.

Daniel began his adventure with SRE because he was interested in the approach from a theoretical perspective. He heard a lot about it, particularly from Google’s materials, and he wanted to know what SRE is and what it looks like in practice. Once he tried it, he found great enjoyment in it. He liked not being assigned to a single project, and having the ability to work with several teams. He can interact with many different environments, which gives him room for growth and opportunities to learn.

Conclusions

While hiring a Site Reliability Engineer may not be the best solution for every business, it’s certainly a worthwhile approach at companies that have multiple software development processes at the core of their business. It’s also always a good idea to apply SRE’s principles of knowledge-sharing and high efficiency. We use SRE at iRonin and it serves us well. If you have doubts about whether SRE is the right approach for you, don’t hesitate to ask - we’ll be happy to help you find the answer.

Site Reliability Engineering helps companies reduce costs, shorten development time and use resources efficiently. iRonin uses this approach in our commercial projects, which greatly benefits our clients.