Completed Part 2

This commit is contained in:
Soulmanos 2020-10-26 13:14:50 +00:00
parent 6e97d37535
commit 8c456da503

294
Part_2.md
View File

@ -326,22 +326,314 @@ What about global services?
Toil: _"Work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows"_
1. Manual - This characteristic extends to include the running of a script - which, although saves time must still be run by hand
2. Repetitive - If a task is repeated multiple times, not just the once or twice, then the work is toil
3. Automatable - Should the task be done by a machine just as well as by a person, you can consider it toil
4. Tactical - Toil, by its very nature, is not proactive or strategy driven. Rather, it is reactive and interrupt-driven; e.g. pager alerts
5. Devoid of enduring value - Tasks that contribute to adding a permanent improvement to the service are not considered toil, but work that does not change the state, is.
6. Scales linearly as service grows - The best designed service can grow by at least one order of magnitude without change; tasks that scale up with service size or traffic are toil
What Toil isn't: Email, Expense Reports, Commuting, Meetings - None of these have the one qualification that a tasks needs to be labelled "Toil" - _"Not tied to a production service"_ - These are instead "Overhead"
Toil Reduction Benefits
- Increased engineering time
- Higher team morale, lower burnout
- Increased process standardization
- Enhanced team technical skills (automation)
- Fewer human error outages
- Shorter incident response times
3x Top Tips for Reducing Toil
1. Identify toil - Make sure youu're differentiating it from overhead or actual engineering
2. Estimate the time to automate - Make sure the benefits outweigh the cost
3. Measure everything including context switching e.g. the time it takes you to switch to a new task and become involved in it
**Risk is at the core of SRE**
### Generating SRE Metrics
#### Monitoring Reliability
The best way to measure everything is to monitor everything
_"Collecting, processing, aggregating and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times and server lifetimes"_
Why monitor?
- Analyzing long-term trends
- Comparing over time groups
- Alerting (real-time)
- Exposing in dashboards
- Debugging
- Raw input for business analytics
- Security breach analysis
2x Different Types of Monitoring
| White-Box | vs | Black-Box |
| -- | -- | -- |
| * Metrics exposed by the internals of the system | | * Testing externally visible behaviour as a user would see it |
| * Focus of predicting problems that encroach on your SLO | | * Symptom-oriented, active problems |
| * Heavy use recommended | | * Moderate use of critical issues |
| * Best for detecting imminent issues | | * Best for paging of incidents |
Metrics - Numerical measurements representing attributes and events
* GCP's Cloud Monitoring (Formerly Stack-driver)
* Collects a large number of metrics from every service at Google
* Provides much less granular information, but in near real-time
* Alerts and dashboards typically use metrics
* Real-time nature means engineers are notified of problems rapidly
* It's most critical to visualise the data in dashboards on the landing page
Logging - Append-only record of events
* GCP's Cloud Logging (Stack-driver logging)
* Can contain large volumes of highly granular data
* Often difficult to sift through data to find what you're looking for
* Inherent delay between when an event occurs and when it is visible in logs
* Logs can be proccessed with a batch system, interrogated with ad hoc queries and visualised with dashboards
* Use logs to find the root cause of an issue, as the information needed is often not available as a metric
* For non-time-sensitive reporting, generate details reports using log processing systems
* Logs will nearly always produce more accurate data than metrics
#### Alerting Principals
_"Alerts give timely awareness to problems in your cloud applications so you can resolve the problems quickly"_
1. Set up monitoring
- Conditions are continuously monitored
- Monitoring can track SLOs
- Can look for a missing metric
- Can watch for thresholds
2. Track metrics over time
- Track if condition persists for given amount of time
- Time window (due to technical constraints) less than 24 hours
3. Notify when condition passed
- Incident created and displayed
- Alerts can be sent via: Email, Text message, apps (slack), pub/sub etc
How do you know when to setup an alert?
- A key factor is how fast you're burning your error budget
- Error Budget Burn Rate = 100% - SLO * (Events over set time)
- Example: If your SLO goal is 98%, then it's acceptable for 2% of the events measured by your SLO to fail, before your SLO goal is missed. And this will tell you the burn rate, which is how fast you're consuming your error budget. A burn rate of > 1 indicates that if the currently measured error rate is sustained over any future compliance period, the service will be out of SLO for that period. Now to combat that, you need a burn rate alerting policy which will notify you when your error budget is consumed faster than the threshold you defined, as measured over that alerts compliance period.
100% - 98% = 2% = 0.002
0.002 * (12,000 fails/30 days) = BurnRate
0.002 * 400 = 0.8 - Because this is < 1, we will still have some room left in our error budget (400x fails per day)
400/24 hours = 16.67 fails/hour - Maybe bring this down to a 5 minute period, that you'd want to watch in an alerting period.
Slow Burn Alerting Policy
- Warns that rate of consumption could exhaust error budget before the end of the compliance period
- Less urgent than fast-burn condition
- Requires longer lookback period (24 hours max)
- Determines how far back in time you're retrieving data
- Threshold should be slightly higher than the baseline
Fast Burn Alerting Policy
- Warns of a sudden, large change in consumption that, if uncorrected, will exhaust error budget quickly
- Shorter lookback period (i.e. 1-2 hours / 1-2 mins) recommended
- Set threshold much higher than the baseline, e.g. 10x to avoid overly sensitive alerts
Establishing an SLO alerting Policy
1. Select SLO to monitor
- Choose the desired SLO; its best to only monitor one SLO at a time
2. Construct a condition for alerting policy
- It's likely you'll have multiple conditions for every alerting policy, such as one for slow burn and another for fast burn
3. Identify notification channel
- Multiple notification channels are possible including email, SMS, pager, app, webhook, pub/sub
4. Provide documentation
- This is an optional but highly recommended step that provides your team with the information about the alert which can help them resolve the issue
5. Create alerting policy
- Bring all the pieces together to complete your alerting policy in either the console or via the API.
#### Investigating SRE Tools
Dev Tools
- Kubernetes Engine - A managed production-ready environment for running containerized applications
- Container Registry - Single place for your team to securely manage container images used by kubernetes
- Cloud Build - Automates the proccess. A service that executes your builds in a series of steps where each step is run in a container
- Cloud Source Repositories - Fully managed private Git respositories with integrations for CI, delivery and deployment
- Spinnaker for Google Cloud - Integrates Spinnaker with other GCP services, allowing you to extend your CI/CD pipeline and integrate security and compliance in the process
Ops Tools
- Cloud Monitoring - Tracking metrics, provides visibility into the performance, uptime and overall health of cloud-powered applications
- Cloud Logging - Allows you to store, search, analyze, monitor and alert on log data and events from Google Cloud
- Cloud Debugger - Lets you inspect the state of a running application in real time, without stopping or slowing it down
- Cloud Trace - A distributed tracing system that collects latency data from your application and displays it in the console
- Cloud profiler - Continuously gathers CPU usage and memory-allocation information from your production applications
Metric always displayed in an aggregate sense - How many events happened during this time period. Metric event is not about events, but about status, like "the system was handling 1200 connection requests per second @ 12:05:58. Latency of metrics and logging information, the time that it takes for the data to flow from the part of the system that has the information to the monitoring system where you can see that show up. Why isn't that instantaneous?, What happens in between? This is a good exercise to go through because not only will you then understand this measurement part of SRE much better, you'll also be able to debug things when your own measurements have some trouble. What might have gone wrong and could be blocking the flow of information? You can turn logs into metrics with filters, looking for certain types of events and then have metrics that show how many you're getting, sometimes even alert on those events. Log based metrics will always have additional latency.
### Reacting to Incidents
#### Handling Incident Response
Poorly managed incident response
Engineer on duty freaks out because DC gone down, ops team decide to roll-back to previous version (they're just shooting from the hip). 3rd DC down, VP gets angry customer calls and it's her first knowledge of the problem, Simon (know it all) even though he's not on-call and not communicating with anyone else, thinks he can fix the issue and rolls out his fix, more DC's go down. What's wrong here? - Need to step back from trying to find the technical solution. Poor communication to the VP, freelancing agents shouldn't have access
Well manage incident response
Engineer on duty Clair; Message received, delegates Alex as incident commander, Alex brings in Ravi as part of operations team, they don't fix within their timezone, and pass on to new incident commander, along the way, they've been contributing to an incident report keeping the vice-president informed, able to sign off on some customer messaging for complete transparency. Simon is not needed and works on his side project
Initiates protocol and appoints incident commander. Works with one operations team and passes control at end of day if issue persists. VP and all stakeholders informed at start of incident in the loop, can coordinate public response. Freelance agent not wanted, called if necessary
1. Separation of Responsibilities
- Specific roles should be designated to team members, each with full autonomy in their role.
Roles should include:
- Incident commander
- Person in charge during incident, designating responsibilities and taking all roles not designated
- Operational Team
- Personnel designated with actual responses to incident. Only people that are authorized to take any action, e.g. rolling out fixes
- Communication Lead
- Public face of incident response, reponsible for issuing updates to stakeholders
- Planning Lead
- Supports Ops team with long-term actions, such as filing bugs, arranging hand-off if necessary and tracking system changes
2. Established Command Post
- The "post" could be a physical location or, more likely in a large company, a communication venue such as a slack channel
3. Live Incident State Document
- A shared docuement that reflects the current state of the incident, updated as necessary and retained for postmortem
4. Clear, Real-time Handoff
- If the day is ending and the issue remains unresolved, an explicit handoff to another incident commander must take place
**3x Questions to Determine an Incident**
If yes to any of these questions:
1. Need another team?
2. Outage visible to users?
3. Unresolved after an hour?
Incident Management Best Practices
- Develop and document procedures
- Prioritize damage and restore service - Take care of the buggest issues first
- Trust team members - Give team autonomy they need without second guessing
- If overwhelmed, get help
- Consider response alternatives
- Practice procedure routinely
- Share the lead in all the roles - Rotate roles among team members, so everyone gains the experience they need to manage the incident
#### Managing Service Lifecycle
How does an SRE view the lifecycle of a service?
SRE Engagement Over Service Lifecycle **(Graph)**
**Architecture and Development (1st of 5 stages)**
- Peaks higher than at any other point during the lifecycle
- Implementing Best Practices with the Dev Team
- Recommend best infrastructure systems
- Co-design part of service with dev team
- Avoid costly re-designs
**Active Development**
- SREs begin productionizing the service
- Planning for capacity
- Adding resources for redundancy
- Planning for spike and overloads
- Implementing load balancing
- Adding monitoring, alerting and performance tuning that will become so important when developing their SLI's & SLOs
**Limited Availability (Alpha and Beta programmes)**
- Measure increasing performance load on the service (Begin to measure and track their SLI's)
- Evaluate reliability
- Define SLOs which will lead to SLA specifics
- Build capacity models
- Establish incident responses shared with dev team so everyone is on the same page and knows common tactics to take when problems arise
**General Availability (hopefully the longest stage)**
- Production Readiness Review passed
- SREs handle majority of op work
- Incident responses
- Track operational load and SLOs and making sure that everything is in accordance so the error budget is not exhausted and new features can be rolled out in a timely fashion
**Depreciation**
- SREs operate existing system
- Support transistion with Dev team
- Work with dev team on designing new system; adjust staffing accordingly
_"SRE principals aim to maximize the engineering velocity of developer teams while keeping products reliable"_
#### Ensuring Healthly Operations Collaboration
####
Post-mortem or Retrospective
What a postmortem is not:
- It's not a funeral
- It's not a party
**A postmortem is an investigation**
- Get metadata
- What systems were affected?
- What personnel were involved?
- Include machine readable data:
- Time to identify
- Time to act
- Time to resolve
- Recreate timeline
- When and how was the incident reported?
- When did the response start?
- When and how did we make it better?
- When was it over?
- Generate report
- Report will be initiated by the incident commander
- All participants need to add their own details on actions taken
- Was there anything done that needs to be rolled back?
**Remember: NO BLAME**
- No one is at fault
- No one will be shamed
- No one will be fired
- Everyone learns
Production Meeting Collaboration
1. Upcoming production changes
- Default to enabling change, which requires tracking the useful properties of that change: start time, duration, expected effect and so on. This is called near-term horizon visibility
2. Metrics
- Review current SLOs, even if they are in line. Track how latency figures, CPU utilization figures, etc.. change over time
3. Outages
- The big picture portion of the meeting can be devoted to a synopsis of the postmortem or working on the process
4. Paging Events
- The tactical view: the list of pages, who was pages, what happened then, and so on. Two primary questions: should that alert have paged the way it did, and should it have paged at all?
5. Nonpaging Event #1
- What events didn't get paged, but probably should have?
6. Nonpaging Event #2 and #3
- What events occured that are not pageable and require attention? What events are not pageable and do not require attention?
####
[Class SRE Implements DevOps](https://www.youtube.com/playlist?list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj)