Site reliability engineering
Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems. Defined by Ben Treynor, founder of Google's Site Reliability Team: "what happens when a software engineer is tasked with what used to be called operations."[1]
History
Site Reliability Engineering was created at Google around 2003 when Ben Treynor was hired to lead a team of seven software engineers to run a production environment. The team was tasked to make Google's sites run smoothly, efficiently and more reliably. Early on, Google's large-scale systems required the company to come up with new paradigms on how to manage such large systems that have never existed before and at the same time introduce new features continuously but at a very high-quality end user experience. The SRE footprint at Google is now larger than 1500 engineers with each product consisting from one to a small team of SREs supporting it. Those SRE processes that have been honed over the years are being used by other mainly large scale companies that are also starting to implement this paradigm. Microsoft, Apple, Twitter, Facebook, Dropbox, Amazon and Oracle have all put together SRE teams.
Roles
A site reliability engineer (SRE) will ideally spend up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal SRE candidate is a coder who also has operational and systems knowledge and likes to whittle down complex tasks.
DevOps vs SRE? DevOps is a practice, which was coined around 2008, that encompasses automation of manual tasks, continuous integration and continuous delivery. It applies to a wide audience of companies whereas SRE might be considered a subset of DevOps that possesses additional skill sets.
See also
- Cloud computing
- Data center
- High availability software
- Operations management
- Operations, administration and management
- Reliability engineering
- System administration
References
- ↑ Are SRE the next data scientists?, TechCrunch, Mar 2, 2016, Donald Fischer
- General
- Site Reliability Engineering, O'Reilly Media, April 2016, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy, ISBN 978-1-4919-2909-4
- The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2, Thomas Limoncelli, ISBN 032194318X