devops-exercises/topics/sre
Ronan 4ed03e7125
Add definitions for SLI, SLO, SLA, error budget and toil (#9077)
* add definitions for SLI, SLO, SLA, error budget and toil

* add credit

* Add credits section

* add google sre book under questions
2024-02-02 15:23:54 +02:00
..
README.md Add definitions for SLI, SLO, SLA, error budget and toil (#9077) 2024-02-02 15:23:54 +02:00

Site Reliability Engineering

SRE Questions

What is an SLI (Service-Level Indicator)? An SLI is a measurement used to assess the actual performance or reliability of a service. It serves as the basis for defining SLOs.

Examples:

  • Request latency
  • Processing throughput
  • Request failures per unit of time

Read more: Google SRE Handbook


What is an SLO (Service-Level Objective)?

An SLO is a target value or range of values for a service level that is measured by an SLI

Example: 99% across 30 days for a specific collection of SLIs.

It's also worthy to note that the SLO also serves as a lower bound, indicating that there is no requirement to be more reliable than necessary because doing so can delay the rollout of new features.

Read more: Google SRE Handbook


What is an SLA (Service-Level Agreement)?

AN SLA is a formal agreement between a service provider and customers, specifying the expected service quality and consequences for not meeting it.

SRE doesn't typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions

Read more: Google SRE Handbook


What is an Error Budget?

An Error Budget represents the acceptable amount of downtime or errors a service can experience while still meeting its SLO.

An error budget is 1 minus the SLO of the service. A 99.9% SLO service has a 0.1% error budget.

If our service receives 1,000,000 requests in four weeks, a 99.9% availability SLO gives us a budget of 1,000 errors over that period.

The error budget is a mechanism for balancing innovation and stability. If the SRE cannot enforce the error budget, the whole system breaks down.

Read more: Google SRE Handbook


What is Toil?

Toil is the kind of work that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.

If you can be automate a task, you should probably automate the task.

Automation significantly reduces Toil. Investing in automation results in valuable work with lasting impact, offering scalability potential with minimal adjustments as your system expands.

Read more: Google SRE Handbook