This commit is contained in:
Alex Soul 2021-02-18 11:02:32 +00:00
parent 00c61824c0
commit 9f33d9d605

View File

@ -32,7 +32,7 @@ Big Picture - What is SRE?
- Easier to review
- Easier to rollback
**Redcue costs of failures**
**Reduce costs of failures**
- Limited "canary" rollouts
- Impact fewest users
- Automate where possible for further cost reduction
@ -41,7 +41,7 @@ Big Picture - What is SRE?
- Leverage Tooling and automation
- Reduce manual tasks
- The heat of the cI/CD pipelines
- The heat of the CI/CD pipelines
- Fosters speed & consistency
**Automate this years job away**
@ -52,7 +52,7 @@ Big Picture - What is SRE?
-------------------------------------
- Measure Everything
- Critical guage of sucess
- Critical guage of success
- CI/CD needs full monitoring
- Synthetic, proactive monitoring
@ -134,25 +134,25 @@ Valid - Known bad events are excluded from the SLI e.g. 400 http
* 3-5 per user journey
* Too many increase difficulty for operators
* Can lead to contradictions
<br>
2. Reduce complexity
* Not all metrics make good SLIs
* Increased response time
* Many false positive
<br>
3. Prioritize Journeys
* Select most valuable to users
* Identify user-centric events
<br>
4. Aggregate similar SLIs
* Collect data over time
* Turn into a rate, average, or percentile
<br>
5. Bucket to distinguish response classes
* Not all request are same
* Requesters may be human, background apps or bots
* Combine (or "bucket") for better SLIs
<br>
6. Collect data at load balancer
* Most efficient method
* Closer to users's experience
@ -176,7 +176,7 @@ SLOs are tied directly to SLIs
- e.g. SLI <= SLO
or
- (lower bound <= SLI <= upper bound) = SLO
- Common SLOs: 99.5%, 99.9% (3x 9's), 9.999% (4x 9's)
- Common SLOs: 99.5%, 99.9% (3x 9's), 99.99% (4x 9's)
SLI - Metrics over time which detail the health of a service
@ -206,7 +206,7 @@ SLO - Critical that there is buy-in from across the organisation. Make sure all
### Understanding SLAs
_"We've the determined the level of availability with our SLIs and declared what target of availability we want to reach with our SLO's and now we need to describe what happens if we don't maintain that availability with an SLA"_
_"We've determined the level of availability with our SLIs and declared what target of availability we want to reach with our SLO's and now we need to describe what happens if we don't maintain that availability with an SLA"_
_"An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain"_
@ -227,6 +227,7 @@ _"An explicit or implicit contract with your users that includes consequences of
- 4x 9's for Instance uptime in multiple zones
- 99.5% for uptime for a single instance
- 4x 9's for load balancing uptime
<br>
- If these aren't met, and customer does meet its obligations then the customer could be eligible to some financial credits
- Clear definitions for language
@ -235,7 +236,7 @@ SLIs drives SLOs, SLOs inform the SLA
- How does the SLO inform the SLA; example:
We want our SLO to be at or under 200ms, and we therefore we want to set our SLA at a higher value e.g. 300ms, and beyond that, "hair on fire ms". You want to set your SLA significantly different to the SLO because you don't want to be consistently setting your customers high on fire. Moreover your SLA's should be routinely achievable because you are working towards your objective (SLO) constantly.
We want our SLO to be at or under 200ms, and therefore we want to set our SLA at a higher value e.g. 300ms, and beyond that, "hair on fire ms". You want to set your SLA significantly different to the SLO because you don't want to be consistently setting your customers hair on fire. Moreover your SLA's should be routinely achievable because you are working towards your objective (SLO) constantly.
To summarize:
@ -249,7 +250,7 @@ SLA
- Incentizes minimum level of service
- Looser than corresponding objectives
An SLI is an indicator of how things are at some particular point in time. Are things good or bad right now? - If our SLI doesn't always confidently tell us that, then it's not a good SLI. An SLO asks whether that indicator has been showing what we want it to for most of the time period that we care about. Has the SLI met our objective? In particular have things been the right amount of both good and bad? An SLA is our agreement of what happens if we don't meet our objective. What's our punishment for doing worse than we agreed we would. The SLI is like the speed of a car. It's travelling at some particular speed right now. The SLO is the speed limit at the upper end and the expected travel time for your trip at the lower end. You want to go approximately some speed over the course of your journey. The SLA is like getting a speeding ticket because you drove to fast, or driving too slowly.
An SLI is an indicator of how things are at some particular point in time. Are things good or bad right now? - If our SLI doesn't always confidently tell us that, then it's not a good SLI. An SLO asks whether that indicator has been showing what we want it to for most of the time period that we care about. Has the SLI met our objective? In particular have things been the right amount of both good and bad? An SLA is our agreement of what happens if we don't meet our objective. What's our punishment for doing worse than we agreed we would. The SLI is like the speed of a car. It's travelling at some particular speed right now. The SLO is the speed limit at the upper end and the expected travel time for your trip at the lower end. You want to go approximately some speed over the course of your journey. The SLA is like getting a speeding ticket because you drove to fast, or drove too slowly.
### Making the Most of Risk
@ -296,31 +297,41 @@ SLO == 99.8%
How will the 0.2% be applied over time?
0.2% == 0.002
0.2% == 0.002 (0.2% written as a decimal e.g. 1% == 0.01, 10% = 0.1 etc)
<br>
0.002 * 30 day/month * 24 hours/day * 60 minutes/hour = 86.4 minutes/month
> This may not seem like much, but this is actual downtime.
What about global services?
- Time-based error budgets are not valid because downtime is almost never universal. Typically an outage will only occur in one part of a system at a time, so it's
- Time-based error budgets are not valid because downtime is almost never universal. Typically an outage will only occur in one part of a system at a time, so it's;
- Better to define availability in terms of request success rate
- Referred to as aggregate availability, and the formula is a little different
- Availability = Successful Requests / Total Requests
- e.g. a System that serves 1M requests per day with 99.9% (3x 9's) availability, can serve up to 1000 errors and still hit it's target for that day
> 999000/1000000 = 0.999 == 99.9%
<br>
So, if we take A = S/T, S = A*T and change our target availability to 99.8%
<br>
100% - 99.8% = 0.2%, 0.2% (in decimal) == 0.002
<br>
Sucessful Requests (Really this is allowed errors based on target availability) = 0.002 * 1,000,000 = 2000 Errors, 1,000,000 - 2000 = 998,000 Successful requests
##### Error budgets, what are they good for?
1. Releasing new features
- Top use by the product team
1. Expected system changes
2. Expected system changes
- roll out enhancements, good to know you are covered should something go wrong
1. Inevitable failure in networks, etc
2. Planned downtime
3. Inevitable failure in networks, etc
4. Planned downtime
1. e.g. take the entire system offline to implement a major upgrae
3. Risky experiments
4. Unforseen circumstances (unknown unknownes) e.g. Global pandemic!
5. Risky experiments
6. Unforseen circumstances (unknown unknownes) e.g. Global pandemic!
#### Defining and Reducing Toil
@ -372,12 +383,12 @@ Why monitor?
2x Different Types of Monitoring
| White-Box | vs | Black-Box |
| -- | -- | -- |
| * Metrics exposed by the internals of the system | | * Testing externally visible behaviour as a user would see it |
| * Focus of predicting problems that encroach on your SLO | | * Symptom-oriented, active problems |
| * Heavy use recommended | | * Moderate use of critical issues |
| * Best for detecting imminent issues | | * Best for paging of incidents |
| White-Box | vs | Black-Box |
| -------------------------------------------------------- | --- | ------------------------------------------------------------- |
| * Metrics exposed by the internals of the system | | * Testing externally visible behaviour as a user would see it |
| * Focus of predicting problems that encroach on your SLO | | * Symptom-oriented, active problems |
| * Heavy use recommended | | * Moderate use of critical issues |
| * Best for detecting imminent issues | | * Best for paging of incidents |
Metrics - Numerical measurements representing attributes and events