Short description: Use of software engineering practices for IT
Site reliability engineering (SRE) is a set of principles and practices that applies aspects of software engineering to IT infrastructure and operations.[1] SRE claims to create highly reliable and scalable software systems. Although they are closely related, SRE is slightly different from DevOps.[2][3][4]
Contents
1History
2Definition
3Principles and practices
4Implementations
4.1Kitchen Sink, a.k.a. “Everything SRE”
4.2Infrastructure
4.3Tools
4.4Product or application
4.5Embedded
4.6Consulting
5Industry
6See also
7References
8Further reading
9External links
History
The field of site reliability engineering originated at Google with Ben Treynor Sloss,[5][6] who founded a site reliability team after joining the company in 2003.[7] In 2016, Google employed more than 1,000 site reliability engineers.[8] After originating at Google in 2003, the concept spread into the broader software development industry, and other companies subsequently began to employ site reliability engineers.[9] The position is more common at larger web companies, as small companies often do not operate at a scale that would require dedicated SREs.[9] Organizations that have adopted the concept include Airbnb, Dropbox, IBM,[10] LinkedIn, Netflix,[8] and Wikimedia.[11] According to a 2021 report by the DevOps Institute, 22% of organizations in a survey of 2,000 respondents had adopted the SRE model.[12][13]
Definition
Site reliability engineering, as a job role, may be performed by individual contributors or organized in teams, responsible for a combination of the following within a broader engineering organization: System availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.[14] Site reliability engineers often have backgrounds in software engineering, system engineering, or system administration.[15] Focuses of SRE include automation, system design, and improvements to system resilience.[15]
Site reliability engineering, as a set of principles and practices, can be performed by anyone. Though everyone should contribute to good practices, as occurs in security engineering, a company may eventually hire specialists and engineers for the job.[citation needed]
Site reliability engineering has also been described as a specific implementation of DevOps, although they differ slightly. SRE focuses specifically on building reliable systems, whereas DevOps focuses more broadly.[2][3][4] Although they have different focuses, some companies have rebranded their operations teams to SRE teams with little meaningful change.[9]
Principles and practices
There have been multiple attempts to define a canonical list of site reliability engineering principles, but while consensus is lacking, the following characteristics are usually included in most definitions:[1][16]
Automation or elimination of anything repetitive in a cost-effective way.
Avoidance to pursue much more reliability than what's strictly necessary. Defining what's necessary is a practice by itself (see list of practices below).
Systems designed with a bias toward the reduction of risks to availability, latency, and efficiency.
Observability—as in, the ability to ask arbitrary questions about a system without having to know ahead of time what to ask.[17]
The site reliability engineering practices also vary widely, but the list below is relatively commonly seen as at least partially implemented:
Toil management as the implementation of the first principle outlined above.
Defining and measuring reliability goals—SLIs, SLOs, and error budgets.
Non-Abstract Large Scale Systems Design (NALSD) with a focus on reliability.
Designing for and implementing observability.
Defining, testing, and running an incident management process.
Capacity planning.
Change and release management, including CI/CD.
Chaos engineering.
Implementations
Site reliability engineering teams engage with the other teams within their companies and the SRE principles and practices in various forms. Here is a high-level overview of common SRE team implementations:[18]
Kitchen Sink, a.k.a. “Everything SRE”
The scope of services or workflows covered is usually unbounded.
Infrastructure
These focus on the reliability of behind-the-scenes systems that help make other teams' jobs more efficient. These are often confused with "Platform" teams or "Platform Operations" teams. Infrastructure SRE teams may pair up with one or more platform engineering team(s), but they differ in that Infrastructure SRE teams focus on performing most, if not all, of the work described in the principles and practices listed above. Platform teams tend to focus on building the platform, and while reliability is desirable, that's not their sole priority.
Tools
The teams focus on tools to measure, maintain, and improve system reliability. For example, Nagios Core or Prometheus.
Product or application
SRE team for product and/or application. Some large companies tend to staff several of these.
Embedded
Usually, SRE solo practitioners or pairs staffed within a software engineering team apply most of the principles and practices described above.
Consulting
These teams consult on how to implement SRE principles and practices. These are usually experienced SREs who've worked on teams in one or several of the implementations above. SREs on external facing consulting SRE teams are often called "Customer Reliability Engineers". They rarely, if ever, change the customer's configuration or code.
Large companies who have adopted SRE tend to have a combination of the implementations described above, including multiple teams of the same implementation, e.g. multiple Product/application SRE teams to meet specific demands of several products and an Infrastructure SRE team to pair up with a Platform engineering group to meet reliability goals of a common platform for both products/applications.
Industry
The USENIX organization has held an annual SREcon conference since 2014 for site reliability engineers in the industry and also holds regional conferences with similar themes.[19]
See also
Chaos engineering
Cloud computing
Data center
Disaster recovery
High availability software
Infrastructure as code
Operations, administration and management
Operations management
Reliability engineering
System administration
References
↑ 1.01.1"Evaluating where your team lies on the SRE spectrum" (in en). https://cloud.google.com/blog/products/devops-sre/evaluating-where-your-team-lies-on-the-sre-spectrum/.
↑ 2.02.1Beyer, Betsy, ed (2016). Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media. ISBN 978-1-4919-5118-7. OCLC 945577030. https://sre.google/sre-book/table-of-contents/.
↑ 3.03.1Vargo, Seth; Fong-Jones, Liz (March 1, 2018). What's the Difference Between DevOps and SRE? (class SRE implements DevOps) (Video). Google.
↑ 4.04.1"What is SRE? - SRE Explained - AWS" (in en-US). https://aws.amazon.com/what-is/sre/.
↑Hill, Patrick. "Love DevOps? Wait until you meet SRE" (in en). https://www.atlassian.com/incident-management/devops/sre.
↑"What is SRE?" (in en). https://www.redhat.com/en/topics/devops/what-is-sre.
↑Treynor, Ben (2014). "Keys to SRE". https://www.usenix.org/conference/srecon14/technical-sessions/presentation/keys-sre.
↑ 8.08.1Fischer, Donald (March 2, 2016). "Are site reliability engineers the next data scientists?" (in en-US). https://social.techcrunch.com/2016/03/02/are-site-reliability-engineers-the-next-data-scientists/.
↑ 9.09.19.2Gossett, Stephen (June 1, 2020). "What Is a Site Reliability Engineer? What Does an SRE Do?" (in en). https://builtin.com/software-engineering-perspectives/site-reliability-engineer.
↑"Site Reliability Engineering" (in en). IBM. November 12, 2020. https://www.ibm.com/cloud/learn/site-reliability-engineering.
↑"SRE - Wikitech" (in en). https://wikitech.wikimedia.org/wiki/SRE.
↑Oehrlich, Eveline (May 4, 2021). "What it takes to be a site reliability engineer" (in en). Micro Focus. https://techbeacon.com/enterprise-it/what-it-takes-be-site-reliability-engineer.
↑Treynor, Ben. "In Conversation" (Interview). Interviewed by Niall Murphy. Google Site Reliability Engineering.
↑"The 7 SRE Principles [And How to Put Them Into Practice"] (in en). https://www.blameless.com//blog/sre-principles.
↑"Learn about observability | Honeycomb" (in en). https://docs.honeycomb.io/getting-started/learning-about-observability/.
↑"SRE at Google: How to structure your SRE team" (in en). https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-organized-and-how-to-get-started/.
Limoncelli, Tom; Chalup, Strata R.; Hogan, Christina J. (September 2014). The Practice of Cloud System Administration: DevOps and SRE Practices for Web Services. 2. Upper Saddle River, NJ: Addison-Wesley. ISBN 978-0133478549. OCLC 891786231. https://www.worldcat.org/oclc/891786231.
Site Reliability Engineering: How Google Runs Production Systems. O'Reilly. 2016. ISBN 978-1491929124. https://archive.org/details/sitereliabilitye0000unse.
Blank-Edelman, David N., ed (2018). Seeking SRE: Conversations About Running Production Systems at Scale (1 ed.). Sebastopol, CA: O'Reilly. ISBN 978-1491978863. OCLC 1052565720. https://www.worldcat.org/oclc/1052565720.
Beyer, Betsy; Murphy, Niall; Kawahara, Kent; Rensin, David; Thorne, Stephen (2018). The Site Reliability Workbook: Practical Ways to Implement SRE. O'Reilly. ISBN 978-1492029502.
Welch, Nat (2018). Real-World SRE: The Survival Guide for Responding to a System Outage and Maximizing Uptime. Packt. ISBN 978-1788628884.
Adkins, Heather; Beyer, Betsy; Blankinship, Paul; Lewandowski, Piotr; Oprea, Ana; Stubblefield, Adam (2020). Building Secure and Reliable Systems: Best Practices for Designing, Implementing, and Maintaining Systems. O'Reilly. ISBN 978-1-4920-8312-2. OCLC 1129470292.
Rosenthal, Jones, Casey, Nora (2020). Chaos Engineering: System Resiliency in Practice. O'Reilly. ISBN 978-1492043867.
External links
Awesome Site Reliability Engineering resources list
How they SRE resources list
SRE Weekly weekly newsletter devoted to SRE
SRE at Google landing page for learning more about SRE in Google
Komodor K8s Reliability learning center with resources for SREs working with Kubernetes
v
t
e
Software engineering
Fields
Computer programming
Requirements engineering
Software deployment
Software design
Software maintenance
Software testing
Systems analysis
Formal methods
Concepts
Data modeling
Enterprise architecture
Functional specification
Modeling language
Orthogonality
Programming paradigm
Software
Software archaeology
Software architecture
Software configuration management
Software development methodology
Software development process
Software quality
Software quality assurance
Software verification and validation
Structured analysis
Orientations
Agile
Aspect-oriented
Object orientation
Ontology
Service orientation
SDLC
Models
Developmental
Agile
EUP
Executable UML
Incremental model
Iterative model
Prototype model
RAD
UP
Scrum
Spiral model
V-Model
Waterfall model
XP
Other
SPICE
CMMI
Data model
ER model
Function model
Information model
Metamodeling
Object model
Systems model
View model
Languages
IDEF
UML
USL
SysML
Software engineers
Victor Basili
Kent Beck
Grady Booch
Fred Brooks
Barry Boehm
Peter Chen
Danese Cooper
Ward Cunningham
Tom DeMarco
Edsger W. Dijkstra
Delores M. Etter
Martin Fowler
Adele Goldstine
Margaret Hamilton
C. A. R. Hoare
Lois Haibt
Mary Jean Harrold
Grace Hopper
Watts Humphrey
Michael A. Jackson
Ivar Jacobson
Alan Kay
Nancy Leveson
Stephen J. Mellor
Bertrand Meyer
David Parnas
Trygve Reenskaug
Winston W. Royce
James Rumbaugh
Mary Shaw
Peri Tarr
Elaine Weyuker
Niklaus Wirth
Edward Yourdon
Related fields
Computer science
Computer engineering
Project management
Risk management
Systems engineering
Category
Commons
0.00
(0 votes)
Original source: https://en.wikipedia.org/wiki/Site reliability engineering. Read more