Typos, formatting
This commit is contained in:
parent
9f33d9d605
commit
753053326c
40
Part_2.md
40
Part_2.md
@ -319,19 +319,19 @@ So, if we take A = S/T, S = A*T and change our target availability to 99.8%
|
|||||||
100% - 99.8% = 0.2%, 0.2% (in decimal) == 0.002
|
100% - 99.8% = 0.2%, 0.2% (in decimal) == 0.002
|
||||||
<br>
|
<br>
|
||||||
Sucessful Requests (Really this is allowed errors based on target availability) = 0.002 * 1,000,000 = 2000 Errors, 1,000,000 - 2000 = 998,000 Successful requests
|
Sucessful Requests (Really this is allowed errors based on target availability) = 0.002 * 1,000,000 = 2000 Errors, 1,000,000 - 2000 = 998,000 Successful requests
|
||||||
|
<br>
|
||||||
|
|
||||||
##### Error budgets, what are they good for?
|
##### Error budgets, what are they good for?
|
||||||
|
|
||||||
1. Releasing new features
|
1. Releasing new features
|
||||||
- Top use by the product team
|
1. Top use by the product team
|
||||||
2. Expected system changes
|
2. Expected system changes
|
||||||
- roll out enhancements, good to know you are covered should something go wrong
|
1. roll out enhancements, good to know you are covered should something go wrong
|
||||||
3. Inevitable failure in networks, etc
|
3. Inevitable failure in networks, etc
|
||||||
4. Planned downtime
|
4. Planned downtime
|
||||||
1. e.g. take the entire system offline to implement a major upgrae
|
1. e.g. take the entire system offline to implement a major upgrade
|
||||||
5. Risky experiments
|
5. Risky experiments
|
||||||
6. Unforseen circumstances (unknown unknownes) e.g. Global pandemic!
|
6. Unforseen circumstances (unknown unknowns) e.g. Global pandemic!
|
||||||
|
|
||||||
#### Defining and Reducing Toil
|
#### Defining and Reducing Toil
|
||||||
|
|
||||||
@ -357,7 +357,7 @@ Toil Reduction Benefits
|
|||||||
|
|
||||||
3x Top Tips for Reducing Toil
|
3x Top Tips for Reducing Toil
|
||||||
|
|
||||||
1. Identify toil - Make sure youu're differentiating it from overhead or actual engineering
|
1. Identify toil - Make sure you're differentiating it from overhead or actual engineering
|
||||||
2. Estimate the time to automate - Make sure the benefits outweigh the cost
|
2. Estimate the time to automate - Make sure the benefits outweigh the cost
|
||||||
3. Measure everything including context switching e.g. the time it takes you to switch to a new task and become involved in it
|
3. Measure everything including context switching e.g. the time it takes you to switch to a new task and become involved in it
|
||||||
|
|
||||||
@ -407,7 +407,7 @@ Logging - Append-only record of events
|
|||||||
* Inherent delay between when an event occurs and when it is visible in logs
|
* Inherent delay between when an event occurs and when it is visible in logs
|
||||||
* Logs can be proccessed with a batch system, interrogated with ad hoc queries and visualised with dashboards
|
* Logs can be proccessed with a batch system, interrogated with ad hoc queries and visualised with dashboards
|
||||||
* Use logs to find the root cause of an issue, as the information needed is often not available as a metric
|
* Use logs to find the root cause of an issue, as the information needed is often not available as a metric
|
||||||
* For non-time-sensitive reporting, generate details reports using log processing systems
|
* For non-time-sensitive reporting, generate detailed reports using log processing systems
|
||||||
* Logs will nearly always produce more accurate data than metrics
|
* Logs will nearly always produce more accurate data than metrics
|
||||||
|
|
||||||
#### Alerting Principals
|
#### Alerting Principals
|
||||||
@ -516,10 +516,10 @@ Roles should include:
|
|||||||
|
|
||||||
2. Established Command Post
|
2. Established Command Post
|
||||||
- The "post" could be a physical location or, more likely in a large company, a communication venue such as a slack channel
|
- The "post" could be a physical location or, more likely in a large company, a communication venue such as a slack channel
|
||||||
|
<br>
|
||||||
3. Live Incident State Document
|
3. Live Incident State Document
|
||||||
- A shared docuement that reflects the current state of the incident, updated as necessary and retained for postmortem
|
- A shared document that reflects the current state of the incident, updated as necessary and retained for postmortem
|
||||||
|
<br>
|
||||||
4. Clear, Real-time Handoff
|
4. Clear, Real-time Handoff
|
||||||
- If the day is ending and the issue remains unresolved, an explicit handoff to another incident commander must take place
|
- If the day is ending and the issue remains unresolved, an explicit handoff to another incident commander must take place
|
||||||
|
|
||||||
@ -534,7 +534,7 @@ If yes to any of these questions:
|
|||||||
Incident Management Best Practices
|
Incident Management Best Practices
|
||||||
|
|
||||||
- Develop and document procedures
|
- Develop and document procedures
|
||||||
- Prioritize damage and restore service - Take care of the buggest issues first
|
- Prioritize damage and restore service - Take care of the biggest issues first
|
||||||
- Trust team members - Give team autonomy they need without second guessing
|
- Trust team members - Give team autonomy they need without second guessing
|
||||||
- If overwhelmed, get help
|
- If overwhelmed, get help
|
||||||
- Consider response alternatives
|
- Consider response alternatives
|
||||||
@ -605,13 +605,13 @@ What a postmortem is not:
|
|||||||
- Time to identify
|
- Time to identify
|
||||||
- Time to act
|
- Time to act
|
||||||
- Time to resolve
|
- Time to resolve
|
||||||
|
<br>
|
||||||
- Recreate timeline
|
- Recreate timeline
|
||||||
- When and how was the incident reported?
|
- When and how was the incident reported?
|
||||||
- When did the response start?
|
- When did the response start?
|
||||||
- When and how did we make it better?
|
- When and how did we make it better?
|
||||||
- When was it over?
|
- When was it over?
|
||||||
|
<br>
|
||||||
- Generate report
|
- Generate report
|
||||||
- Report will be initiated by the incident commander
|
- Report will be initiated by the incident commander
|
||||||
- All participants need to add their own details on actions taken
|
- All participants need to add their own details on actions taken
|
||||||
@ -629,20 +629,20 @@ Production Meeting Collaboration
|
|||||||
|
|
||||||
1. Upcoming production changes
|
1. Upcoming production changes
|
||||||
- Default to enabling change, which requires tracking the useful properties of that change: start time, duration, expected effect and so on. This is called near-term horizon visibility
|
- Default to enabling change, which requires tracking the useful properties of that change: start time, duration, expected effect and so on. This is called near-term horizon visibility
|
||||||
|
<br>
|
||||||
2. Metrics
|
2. Metrics
|
||||||
- Review current SLOs, even if they are in line. Track how latency figures, CPU utilization figures, etc.. change over time
|
- Review current SLOs, even if they are in line. Track how latency figures, CPU utilization figures, etc.. change over time
|
||||||
|
<br>
|
||||||
3. Outages
|
3. Outages
|
||||||
- The big picture portion of the meeting can be devoted to a synopsis of the postmortem or working on the process
|
- The big picture portion of the meeting can be devoted to a synopsis of the postmortem or working on the process
|
||||||
|
<br>
|
||||||
4. Paging Events
|
4. Paging Events
|
||||||
- The tactical view: the list of pages, who was pages, what happened then, and so on. Two primary questions: should that alert have paged the way it did, and should it have paged at all?
|
- The tactical view: the list of pages, who was pages, what happened then, and so on. Two primary questions: should that alert have paged the way it did, and should it have paged at all?
|
||||||
|
<br>
|
||||||
5. Nonpaging Event #1
|
5. Nonpaging Event `#1`
|
||||||
- What events didn't get paged, but probably should have?
|
- What events didn't get paged, but probably should have?
|
||||||
|
<br>
|
||||||
6. Nonpaging Event #2 and #3
|
6. Nonpaging Event `#2` and `#3`
|
||||||
- What events occured that are not pageable and require attention? What events are not pageable and do not require attention?
|
- What events occured that are not pageable and require attention? What events are not pageable and do not require attention?
|
||||||
|
|
||||||
####
|
####
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user