Typos, formatting
This commit is contained in:
parent
9f33d9d605
commit
753053326c
48
Part_2.md
48
Part_2.md
@ -319,19 +319,19 @@ So, if we take A = S/T, S = A*T and change our target availability to 99.8%
|
||||
100% - 99.8% = 0.2%, 0.2% (in decimal) == 0.002
|
||||
<br>
|
||||
Sucessful Requests (Really this is allowed errors based on target availability) = 0.002 * 1,000,000 = 2000 Errors, 1,000,000 - 2000 = 998,000 Successful requests
|
||||
|
||||
<br>
|
||||
|
||||
##### Error budgets, what are they good for?
|
||||
|
||||
1. Releasing new features
|
||||
- Top use by the product team
|
||||
1. Top use by the product team
|
||||
2. Expected system changes
|
||||
- roll out enhancements, good to know you are covered should something go wrong
|
||||
1. roll out enhancements, good to know you are covered should something go wrong
|
||||
3. Inevitable failure in networks, etc
|
||||
4. Planned downtime
|
||||
1. e.g. take the entire system offline to implement a major upgrae
|
||||
1. e.g. take the entire system offline to implement a major upgrade
|
||||
5. Risky experiments
|
||||
6. Unforseen circumstances (unknown unknownes) e.g. Global pandemic!
|
||||
6. Unforseen circumstances (unknown unknowns) e.g. Global pandemic!
|
||||
|
||||
#### Defining and Reducing Toil
|
||||
|
||||
@ -357,7 +357,7 @@ Toil Reduction Benefits
|
||||
|
||||
3x Top Tips for Reducing Toil
|
||||
|
||||
1. Identify toil - Make sure youu're differentiating it from overhead or actual engineering
|
||||
1. Identify toil - Make sure you're differentiating it from overhead or actual engineering
|
||||
2. Estimate the time to automate - Make sure the benefits outweigh the cost
|
||||
3. Measure everything including context switching e.g. the time it takes you to switch to a new task and become involved in it
|
||||
|
||||
@ -407,7 +407,7 @@ Logging - Append-only record of events
|
||||
* Inherent delay between when an event occurs and when it is visible in logs
|
||||
* Logs can be proccessed with a batch system, interrogated with ad hoc queries and visualised with dashboards
|
||||
* Use logs to find the root cause of an issue, as the information needed is often not available as a metric
|
||||
* For non-time-sensitive reporting, generate details reports using log processing systems
|
||||
* For non-time-sensitive reporting, generate detailed reports using log processing systems
|
||||
* Logs will nearly always produce more accurate data than metrics
|
||||
|
||||
#### Alerting Principals
|
||||
@ -434,10 +434,10 @@ How do you know when to setup an alert?
|
||||
- Error Budget Burn Rate = 100% - SLO * (Events over set time)
|
||||
- Example: If your SLO goal is 98%, then it's acceptable for 2% of the events measured by your SLO to fail, before your SLO goal is missed. And this will tell you the burn rate, which is how fast you're consuming your error budget. A burn rate of > 1 indicates that if the currently measured error rate is sustained over any future compliance period, the service will be out of SLO for that period. Now to combat that, you need a burn rate alerting policy which will notify you when your error budget is consumed faster than the threshold you defined, as measured over that alerts compliance period.
|
||||
|
||||
100% - 98% = 2% = 0.002
|
||||
0.002 * (12,000 fails/30 days) = BurnRate
|
||||
0.002 * 400 = 0.8 - Because this is < 1, we will still have some room left in our error budget (400x fails per day)
|
||||
400/24 hours = 16.67 fails/hour - Maybe bring this down to a 5 minute period, that you'd want to watch in an alerting period.
|
||||
100% - 98% = 2% = 0.002
|
||||
0.002 * (12,000 fails/30 days) = BurnRate
|
||||
0.002 * 400 = 0.8 - Because this is < 1, we will still have some room left in our error budget (400x fails per day)
|
||||
400/24 hours = 16.67 fails/hour - Maybe bring this down to a 5 minute period, that you'd want to watch in an alerting period.
|
||||
|
||||
Slow Burn Alerting Policy
|
||||
|
||||
@ -516,10 +516,10 @@ Roles should include:
|
||||
|
||||
2. Established Command Post
|
||||
- The "post" could be a physical location or, more likely in a large company, a communication venue such as a slack channel
|
||||
|
||||
<br>
|
||||
3. Live Incident State Document
|
||||
- A shared docuement that reflects the current state of the incident, updated as necessary and retained for postmortem
|
||||
|
||||
- A shared document that reflects the current state of the incident, updated as necessary and retained for postmortem
|
||||
<br>
|
||||
4. Clear, Real-time Handoff
|
||||
- If the day is ending and the issue remains unresolved, an explicit handoff to another incident commander must take place
|
||||
|
||||
@ -534,7 +534,7 @@ If yes to any of these questions:
|
||||
Incident Management Best Practices
|
||||
|
||||
- Develop and document procedures
|
||||
- Prioritize damage and restore service - Take care of the buggest issues first
|
||||
- Prioritize damage and restore service - Take care of the biggest issues first
|
||||
- Trust team members - Give team autonomy they need without second guessing
|
||||
- If overwhelmed, get help
|
||||
- Consider response alternatives
|
||||
@ -605,13 +605,13 @@ What a postmortem is not:
|
||||
- Time to identify
|
||||
- Time to act
|
||||
- Time to resolve
|
||||
|
||||
<br>
|
||||
- Recreate timeline
|
||||
- When and how was the incident reported?
|
||||
- When did the response start?
|
||||
- When and how did we make it better?
|
||||
- When was it over?
|
||||
|
||||
<br>
|
||||
- Generate report
|
||||
- Report will be initiated by the incident commander
|
||||
- All participants need to add their own details on actions taken
|
||||
@ -629,20 +629,20 @@ Production Meeting Collaboration
|
||||
|
||||
1. Upcoming production changes
|
||||
- Default to enabling change, which requires tracking the useful properties of that change: start time, duration, expected effect and so on. This is called near-term horizon visibility
|
||||
|
||||
<br>
|
||||
2. Metrics
|
||||
- Review current SLOs, even if they are in line. Track how latency figures, CPU utilization figures, etc.. change over time
|
||||
|
||||
<br>
|
||||
3. Outages
|
||||
- The big picture portion of the meeting can be devoted to a synopsis of the postmortem or working on the process
|
||||
|
||||
<br>
|
||||
4. Paging Events
|
||||
- The tactical view: the list of pages, who was pages, what happened then, and so on. Two primary questions: should that alert have paged the way it did, and should it have paged at all?
|
||||
|
||||
5. Nonpaging Event #1
|
||||
<br>
|
||||
5. Nonpaging Event `#1`
|
||||
- What events didn't get paged, but probably should have?
|
||||
|
||||
6. Nonpaging Event #2 and #3
|
||||
<br>
|
||||
6. Nonpaging Event `#2` and `#3`
|
||||
- What events occured that are not pageable and require attention? What events are not pageable and do not require attention?
|
||||
|
||||
####
|
||||
|
||||
Loading…
Reference in New Issue
Block a user