Site Reliability Engineering

Site Reliability Engineering (SRE) is a discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems. According to Ben Treynor, founder of Google's Site Reliability Team, SRE is "what happens when a software engineer is tasked with what used to be called operations."^[1]

Roles

A site reliability engineer (SRE) will spend up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal site reliability engineer candidate is either a software engineer with a good administration background or a highly skilled system administrator with knowledge of coding and automation^[2].

DevOps vs SRE

Coined around 2008, DevOps is a philosophy of cross-team empathy and business alignment. It's also been associated with a practice that encompasses automation of manual tasks, continuous integration and continuous delivery. SRE and DevOps share the same foundational principles. SRE is viewed by many (as cited in the Google SRE book) as a "specific implementation of DevOps with some idiosyncratic extensions." SREs, being developers themselves, will naturally bring solutions that help remove the barriers between development teams and operations teams.

DevOps defines 5 key pillars of success:

Reduce organizational silos
Accept failure as normal
Implement gradual changes
Leverage tooling and automation
Measure everything

SRE satisfies the DevOps pillars as follows:^[3]

Reduce organizational silos
- SRE shares ownership with developers to create shared responsibility^[4]
- SREs use the same tools that developers use, and vice versa
Accept failure as normal
- SREs embrace risk^[5]
- SRE quantifies failure and availability in a prescriptive manner using Service Level Indicators (SLIs) and Service Level Objectives (SLOs)^[6]
- SRE mandates blameless post mortems^[7]
Implement gradual changes
- SRE encourages developers and product owners to move quickly by reducing the cost of failure^[5]
Leverage tooling and automation
- SREs have a charter to automate menial tasks (called "toil") away^[8]
Measure everything
- SRE defines prescriptive ways to measure values^[9]
- SRE fundamentally believes that systems operation is a software problem

References

↑ Are SRE the next data scientists?, TechCrunch, Mar 2, 2016, Donald Fischer
↑ Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers". ;login: 40 (3): 35-39. https://www.usenix.org/system/files/login/articles/login_june_07_jones.pdf.
↑ Google Cloud Platform (1 March 2018). "What's the Difference Between DevOps and SRE? (class SRE implements DevOps)". pp. 35-39. https://www.youtube.com/watch?v=uTEL8Ff1Zvk.
↑ "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/communication-and-collaboration.html.
↑ ^5.0 ^5.1 "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/embracing-risk.html.
↑ "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/service-level-objectives.html.
↑ "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/postmortem-culture.html.
↑ "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/eliminating-toil.html.
↑ "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html.

General

Site Reliability Engineering: How Google Runs Production Systems, O'Reilly Media, April 2016, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy, ISBN 978-1-491-92912-4
The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2, Thomas Limoncelli, ISBN 032194318X

External links

Google - Site Reliability Engineering interview with Ben Treynor

0.00

(0 votes)

[1] Are SRE the next data scientists?, TechCrunch, Mar 2, 2016, Donald Fischer

[2] Jones, Chris; Underwood, Todd; Nukala, Shylaja (June 2015). "Hiring Site Reliability Engineers". ;login: 40 (3): 35-39. https://www.usenix.org/system/files/login/articles/login_june_07_jones.pdf.

[3] Google Cloud Platform (1 March 2018). "What's the Difference Between DevOps and SRE? (class SRE implements DevOps)". pp. 35-39. https://www.youtube.com/watch?v=uTEL8Ff1Zvk.

[4] "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/communication-and-collaboration.html.

[google.com-5] 5.0 ^5.1 "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/embracing-risk.html.

[6] "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/service-level-objectives.html.

[7] "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/postmortem-culture.html.

[8] "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/eliminating-toil.html.

[9] "Google - Site Reliability Engineering". https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Site Reliability Engineering

Topic: Engineering

Contents

Roles

DevOps vs SRE

See also

References

External links