end of friday
This commit is contained in:
parent
46c836a284
commit
6e97d37535
71
Part_2.md
71
Part_2.md
@ -253,10 +253,81 @@ An SLI is an indicator of how things are at some particular point in time. Are t
|
||||
|
||||
### Making the Most of Risk
|
||||
|
||||
|
||||
|
||||
#### Setting Error Budgets
|
||||
|
||||
In SRE, an error budget is a good thing, and one of the unifying principals between developers and operations.
|
||||
|
||||
_"A quantitative measurement shared between the product and SRE teans to balance innovation and stability"_
|
||||
|
||||
The process for defining an error budget:
|
||||
|
||||
1. Management buys into the internal SLO
|
||||
1. SLO used to determine the amount of uptime for that particular quarter
|
||||
1. Monitoring services measures actual uptime
|
||||
1. The difference between the expected SLO and the actual uptime is calculated
|
||||
1. Then, if there's sufficient time within the error budget the new release is pushed forward
|
||||
|
||||
Error budgets seem like risky business, why not avoid risk at all costs? - 1 of the key DevOps pillars is that failure is unavoidable and therefore so is risk. SRE's manage service reliability, largely by managing risk.
|
||||
|
||||
- Balances innovation and reliability
|
||||
An error budget provides a common incentive that allows both product development and SRE's to focus on finding the right balance between new features and service availability.
|
||||
|
||||
- Manages release velocity
|
||||
So as long as the systems SLO's are met, releases can continue.
|
||||
|
||||
- Developers oversee own risk
|
||||
When the error budget is large, product developers can take more risks because they have more time to spend. When the budget is almost over with, nearly drained, product developers themselves will push for more testing and slow down their release velocity, because they don't want to risk using the error budget and stalling their launch completely.
|
||||
|
||||
- What happens if the error budget exceeded:
|
||||
- Typically releases temporarily halted
|
||||
- Expansion in system testing and development
|
||||
- Which will overall improve performance
|
||||
|
||||
**Error Budget**
|
||||
|
||||
Error budget = 100% - SLO value
|
||||
|
||||
e.g.
|
||||
|
||||
SLO == 99.8%
|
||||
100% - 99.8% = 0.2%
|
||||
|
||||
How will the 0.2% be applied over time?
|
||||
|
||||
0.2% == 0.002
|
||||
0.002 * 30 day/month * 24 hours/day * 60 minutes/hour = 86.4 minutes/month
|
||||
|
||||
> This may not seem like much, but this is actual downtime.
|
||||
|
||||
What about global services?
|
||||
|
||||
- Time-based error budgets are not valid because downtime is almost never universal. Typically an outage will only occur in one part of a system at a time, so it's
|
||||
- Better to define availability in terms of request success rate
|
||||
- Referred to as aggregate availability, and the formula is a little different
|
||||
- Availability = Successful Requests / Total Requests
|
||||
- e.g. a System that serves 1M requests per day with 99.9% (3x 9's) availability, can serve up to 1000 errors and still hit it's target for that day
|
||||
|
||||
|
||||
##### Error budgets, what are they good for?
|
||||
|
||||
1. Releasing new features
|
||||
- Top use by the product team
|
||||
1. Expected system changes
|
||||
- roll out enhancements, good to know you are covered should something go wrong
|
||||
1. Inevitable failure in networks, etc
|
||||
2. Planned downtime
|
||||
1. e.g. take the entire system offline to implement a major upgrae
|
||||
3. Risky experiments
|
||||
4. Unforseen circumstances (unknown unknownes) e.g. Global pandemic!
|
||||
|
||||
#### Defining and Reducing Toil
|
||||
|
||||
Toil: _"Work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows"_
|
||||
|
||||
|
||||
|
||||
### Generating SRE Metrics
|
||||
|
||||
#### Monitoring Reliability
|
||||
|
||||
Loading…
Reference in New Issue
Block a user