Typos, formatting
This commit is contained in:
parent
9f33d9d605
commit
753053326c
40
Part_2.md
40
Part_2.md
@ -319,19 +319,19 @@ So, if we take A = S/T, S = A*T and change our target availability to 99.8%
|
||||
100% - 99.8% = 0.2%, 0.2% (in decimal) == 0.002
|
||||
<br>
|
||||
Sucessful Requests (Really this is allowed errors based on target availability) = 0.002 * 1,000,000 = 2000 Errors, 1,000,000 - 2000 = 998,000 Successful requests
|
||||
|
||||
<br>
|
||||
|
||||
##### Error budgets, what are they good for?
|
||||
|
||||
1. Releasing new features
|
||||
- Top use by the product team
|
||||
1. Top use by the product team
|
||||
2. Expected system changes
|
||||
- roll out enhancements, good to know you are covered should something go wrong
|
||||
1. roll out enhancements, good to know you are covered should something go wrong
|
||||
3. Inevitable failure in networks, etc
|
||||
4. Planned downtime
|
||||
1. e.g. take the entire system offline to implement a major upgrae
|
||||
1. e.g. take the entire system offline to implement a major upgrade
|
||||
5. Risky experiments
|
||||
6. Unforseen circumstances (unknown unknownes) e.g. Global pandemic!
|
||||
6. Unforseen circumstances (unknown unknowns) e.g. Global pandemic!
|
||||
|
||||
#### Defining and Reducing Toil
|
||||
|
||||
@ -357,7 +357,7 @@ Toil Reduction Benefits
|
||||
|
||||
3x Top Tips for Reducing Toil
|
||||
|
||||
1. Identify toil - Make sure youu're differentiating it from overhead or actual engineering
|
||||
1. Identify toil - Make sure you're differentiating it from overhead or actual engineering
|
||||
2. Estimate the time to automate - Make sure the benefits outweigh the cost
|
||||
3. Measure everything including context switching e.g. the time it takes you to switch to a new task and become involved in it
|
||||
|
||||
@ -407,7 +407,7 @@ Logging - Append-only record of events
|
||||
* Inherent delay between when an event occurs and when it is visible in logs
|
||||
* Logs can be proccessed with a batch system, interrogated with ad hoc queries and visualised with dashboards
|
||||
* Use logs to find the root cause of an issue, as the information needed is often not available as a metric
|
||||
* For non-time-sensitive reporting, generate details reports using log processing systems
|
||||
* For non-time-sensitive reporting, generate detailed reports using log processing systems
|
||||
* Logs will nearly always produce more accurate data than metrics
|
||||
|
||||
#### Alerting Principals
|
||||
@ -516,10 +516,10 @@ Roles should include:
|
||||
|
||||
2. Established Command Post
|
||||
- The "post" could be a physical location or, more likely in a large company, a communication venue such as a slack channel
|
||||
|
||||
<br>
|
||||
3. Live Incident State Document
|
||||
- A shared docuement that reflects the current state of the incident, updated as necessary and retained for postmortem
|
||||
|
||||
- A shared document that reflects the current state of the incident, updated as necessary and retained for postmortem
|
||||
<br>
|
||||
4. Clear, Real-time Handoff
|
||||
- If the day is ending and the issue remains unresolved, an explicit handoff to another incident commander must take place
|
||||
|
||||
@ -534,7 +534,7 @@ If yes to any of these questions:
|
||||
Incident Management Best Practices
|
||||
|
||||
- Develop and document procedures
|
||||
- Prioritize damage and restore service - Take care of the buggest issues first
|
||||
- Prioritize damage and restore service - Take care of the biggest issues first
|
||||
- Trust team members - Give team autonomy they need without second guessing
|
||||
- If overwhelmed, get help
|
||||
- Consider response alternatives
|
||||
@ -605,13 +605,13 @@ What a postmortem is not:
|
||||
- Time to identify
|
||||
- Time to act
|
||||
- Time to resolve
|
||||
|
||||
<br>
|
||||
- Recreate timeline
|
||||
- When and how was the incident reported?
|
||||
- When did the response start?
|
||||
- When and how did we make it better?
|
||||
- When was it over?
|
||||
|
||||
<br>
|
||||
- Generate report
|
||||
- Report will be initiated by the incident commander
|
||||
- All participants need to add their own details on actions taken
|
||||
@ -629,20 +629,20 @@ Production Meeting Collaboration
|
||||
|
||||
1. Upcoming production changes
|
||||
- Default to enabling change, which requires tracking the useful properties of that change: start time, duration, expected effect and so on. This is called near-term horizon visibility
|
||||
|
||||
<br>
|
||||
2. Metrics
|
||||
- Review current SLOs, even if they are in line. Track how latency figures, CPU utilization figures, etc.. change over time
|
||||
|
||||
<br>
|
||||
3. Outages
|
||||
- The big picture portion of the meeting can be devoted to a synopsis of the postmortem or working on the process
|
||||
|
||||
<br>
|
||||
4. Paging Events
|
||||
- The tactical view: the list of pages, who was pages, what happened then, and so on. Two primary questions: should that alert have paged the way it did, and should it have paged at all?
|
||||
|
||||
5. Nonpaging Event #1
|
||||
<br>
|
||||
5. Nonpaging Event `#1`
|
||||
- What events didn't get paged, but probably should have?
|
||||
|
||||
6. Nonpaging Event #2 and #3
|
||||
<br>
|
||||
6. Nonpaging Event `#2` and `#3`
|
||||
- What events occured that are not pageable and require attention? What events are not pageable and do not require attention?
|
||||
|
||||
####
|
||||
|
||||
Loading…
Reference in New Issue
Block a user