gpcdesre/Part_2.md at main

Training/gpcdesre

Fork 0

Alex Soul 9765410f11 Passed exam, w00t

2021-02-23 09:53:47 +00:00

28 KiB

Raw Permalink Blame History

Big Picture - What is SRE?

5 Key Pillars of DevOps + SRE

Reduce organisational silos
- Bridge teams together
- Increase communication
- Shared company vision

Share ownership

Developers + Operations
Implement same tooling
share same techniques

Accept failure as normal
- Try to anticipate, but
- Incidents are bound to occur
- Failure help team learn

No-fault post mortems & SLOs - No two failures the same (goal) - Track incidents (SLI's) - Map to objectives (SLOs)

Implement gradual change
- Continuous change culture
- Small updates are better
- Easier to review
- Easier to rollback

Reduce costs of failures

Limited "canary" rollouts
Impact fewest users
Automate where possible for further cost reduction

Leverage Tooling and automation
- Reduce manual tasks
- The heat of the CI/CD pipelines
- Fosters speed & consistency

Automate this years job away

Automation is a force multiplier
Autonomous automation best
Centralizes mistakes

Measure Everything
- Critical guage of success
- CI/CD needs full monitoring
- Synthetic, proactive monitoring

Measure toil and reliability

Key to SLOs and SLAs
Reduce toil (aka repetitive manual labour!), up engineering
Monitor all over time

Why "Reliability"

Most important: does the product work?
Reliability is the absense of errors
Unstable service likely indicates a variety of issues
Must attend to reliability all the time

(class SRE) = The how implements (DevOps) = The What

Make better software, faster

Understanding SLIs

SRE breaks down into 3 distinct functions

Define availability
1. SLO:
Determine level of availability
1. SLI - Quantifiable measure of reliability; Metrics over time, specific to a user journey, such as request/reponse, data processing or storage. Examples:
  1. Request latency - How long it takes to return a response to a request
  2. Failure Rate - A fraction of all rates received: (unsuccessful requests/all requests)
  3. Batch thoughput - Proportion of time = data processing > than a threshold
Plan in case of failure
1. SLA

Each maps to a key component; SLO, SLI, SLA

What's a User Journey?

Sequence of tasks central to user experience and crucial to service
- e.g. Online shopping journeys
  - Product search
  - Add to cart
  - Checkout

Request/Response Journey:

Availability - Proportion of valid requests served successfully
Latency - Proportion of valid requests served faster than a threshold
Quality - Proportion of valid requests served maintaining quality

None of these map specifically to a user journey, however they are all part of that

Data processing journey: Might include a different set of SLIs

Freshness - Proportion of valid data updated more recently than a threshold
Correctness - Proportion of valid data producing correct output
Throughput - Proportion of time where the data processing rate is faster than a threshold
Coverage - Proportion of valid data processed successfully

Google's 4 Golden Signals

Latency - The time is takes for your service to fulfill a request
Errors - The rate at which your service fails
Traffic - How much demand is directed at your service
Saturation - A measure of how close to fully utilized the service's resources are

Transparent SLI's within GCP Dashboard - API's & Services

The SLI Equation:

SLI = (Good Events / Valid Events) * 100

Valid - Known bad events are excluded from the SLI e.g. 400 http

Bad SLI - Variance and overlap in metrics prior to and during outages are problematic; graph contains up and down spikes during an outage Good SLI - Stable signal with a strong correlation to outage is best; graph is smooth.

SLI Best Practices

Limit number of SLIs
- 3-5 per user journey
- Too many increase difficulty for operators
- Can lead to contradictions
Reduce complexity
- Not all metrics make good SLIs
- Increased response time
- Many false positive
Prioritize Journeys
- Select most valuable to users
- Identify user-centric events
Aggregate similar SLIs
- Collect data over time
- Turn into a rate, average, or percentile
Bucket to distinguish response classes
- Not all request are same
- Requesters may be human, background apps or bots
- Combine (or "bucket") for better SLIs
Collect data at load balancer
- Most efficient method
- Closer to users's experience

Understanding SLOs

"SLO's specify a target level for the reliability of your service"

The First rule of SLOs: 100 % reliability is not a good objective

Why?

Trying to reach 100%, 100% of the time, is very expensive in terms of resources
Much more technically complex
Users don't need 100% to be acceptable (get close enough where users don't notice the difference, just reliable enough)
Less than 100% leaves room for new features, as you have resources remaining to develop (error budgets)

SLOs are tied directly to SLIs

Measured by SLI
Can be a single target value or range of values
- e.g. SLI <= SLO or
- (lower bound <= SLI <= upper bound) = SLO
- Common SLOs: 99.5%, 99.9% (3x 9's), 99.99% (4x 9's)

SLI - Metrics over time which detail the health of a service

Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile

SLO - Agreed-upon bounds on how often SLIs must be met

95% percentile homepage SLI will sucess 99.9% of the time over the next year

SLO - Critical that there is buy-in from across the organisation. Make sure all stakeholders agree on the SLO. Everyone on the same team working towards the same goals. Developers, Contributors, Project Managers, SREs, Vice President

Make your SLOs achievable

Based on past performance
- Users expectations are strongly tied to past performance
If no historical data, you need to collect some
Keep in mind: measurement <> User satisfaction, and you may need to adjust your SLOs accordingly

In addition to achievable SLOs, you might have some aspirational SLOs

Typically higher than your achievable SLO's
Set a reasonable target and begin measuring
Compare user feedback to SLOs

Understanding SLAs

"We've determined the level of availability with our SLIs and declared what target of availability we want to reach with our SLO's and now we need to describe what happens if we don't maintain that availability with an SLA"

"An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain"

Should reliability fail, there are consequences

SLA Characteristics

A business-level agreement
- SRE's are not usually involved with the drafting of SLA's except when setting up their SLIs and the corresponding SLOs
Can be explicit or implied (implicit)
Explicit contract contain consequences
- Refund for services paid for
- Service cost reduction on sliding scales
- May be offered on a per service basis

41 SLAs in GCP

Compute Engine
- 4x 9's for Instance uptime in multiple zones
- 99.5% for uptime for a single instance
- 4x 9's for load balancing uptime
If these aren't met, and customer does meet its obligations then the customer could be eligible to some financial credits
Clear definitions for language

SLIs drives SLOs, SLOs inform the SLA

How does the SLO inform the SLA; example:

We want our SLO to be at or under 200ms, and therefore we want to set our SLA at a higher value e.g. 300ms, and beyond that, "hair on fire ms". You want to set your SLA significantly different to the SLO because you don't want to be consistently setting your customers hair on fire. Moreover your SLA's should be routinely achievable because you are working towards your objective (SLO) constantly.

To summarize:

SLO:

Internal targets that guide prioritization
Represents desired user experience
Missing objectives should have consequences

SLA

Set level just enough to keep customers
Incentizes minimum level of service
Looser than corresponding objectives

An SLI is an indicator of how things are at some particular point in time. Are things good or bad right now? - If our SLI doesn't always confidently tell us that, then it's not a good SLI. An SLO asks whether that indicator has been showing what we want it to for most of the time period that we care about. Has the SLI met our objective? In particular have things been the right amount of both good and bad? An SLA is our agreement of what happens if we don't meet our objective. What's our punishment for doing worse than we agreed we would. The SLI is like the speed of a car. It's travelling at some particular speed right now. The SLO is the speed limit at the upper end and the expected travel time for your trip at the lower end. You want to go approximately some speed over the course of your journey. The SLA is like getting a speeding ticket because you drove to fast, or drove too slowly.

Making the Most of Risk

Setting Error Budgets

In SRE, an error budget is a good thing, and one of the unifying principals between developers and operations.

"A quantitative measurement shared between the product and SRE teans to balance innovation and stability"

The process for defining an error budget:

Management buys into the internal SLO
SLO used to determine the amount of uptime for that particular quarter
Monitoring services measures actual uptime
The difference between the expected SLO and the actual uptime is calculated
Then, if there's sufficient time within the error budget the new release is pushed forward

Error budgets seem like risky business, why not avoid risk at all costs? - 1 of the key DevOps pillars is that failure is unavoidable and therefore so is risk. SRE's manage service reliability, largely by managing risk.

Balances innovation and reliability An error budget provides a common incentive that allows both product development and SRE's to focus on finding the right balance between new features and service availability.
Manages release velocity So as long as the systems SLO's are met, releases can continue.
Developers oversee own risk When the error budget is large, product developers can take more risks because they have more time to spend. When the budget is almost over with, nearly drained, product developers themselves will push for more testing and slow down their release velocity, because they don't want to risk using the error budget and stalling their launch completely.
What happens if the error budget exceeded:
- Typically releases temporarily halted
- Expansion in system testing and development
  - Which will overall improve performance

Error Budget

Error budget = 100% - SLO value

e.g.

SLO == 99.8% 100% - 99.8% = 0.2%

How will the 0.2% be applied over time?

0.2% == 0.002 (0.2% written as a decimal e.g. 1% == 0.01, 10% = 0.1 etc)
0.002 * 30 day/month * 24 hours/day * 60 minutes/hour = 86.4 minutes/month

This may not seem like much, but this is actual downtime.

What about global services?

Time-based error budgets are not valid because downtime is almost never universal. Typically an outage will only occur in one part of a system at a time, so it's;
Better to define availability in terms of request success rate
- Referred to as aggregate availability, and the formula is a little different
  - Availability = Successful Requests / Total Requests
    - e.g. a System that serves 1M requests per day with 99.9% (3x 9's) availability, can serve up to 1000 errors and still hit it's target for that day

999000/1000000 = 0.999 == 99.9%

So, if we take A = S/T, S = A*T and change our target availability to 99.8%
100% - 99.8% = 0.2%, 0.2% (in decimal) == 0.002
Successful Requests (Really this is allowed errors based on target availability) = 0.002 * 1,000,000 = 2000 Errors, 1,000,000 - 2000 = 998,000 Successful requests

Error budgets, what are they good for?

Releasing new features
1. Top use by the product team
Expected system changes
1. roll out enhancements, good to know you are covered should something go wrong
Inevitable failure in networks, etc
Planned downtime
1. e.g. take the entire system offline to implement a major upgrade
Risky experiments
Unforseen circumstances (unknown unknowns) e.g. Global pandemic!

Defining and Reducing Toil

Toil: "Work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows"

Manual - This characteristic extends to include the running of a script - which, although saves time must still be run by hand
Repetitive - If a task is repeated multiple times, not just the once or twice, then the work is toil
Automatable - Should the task be done by a machine just as well as by a person, you can consider it toil
Tactical - Toil, by its very nature, is not proactive or strategy driven. Rather, it is reactive and interrupt-driven; e.g. pager alerts
Devoid of enduring value - Tasks that contribute to adding a permanent improvement to the service are not considered toil, but work that does not change the state, is.
Scales linearly as service grows - The best designed service can grow by at least one order of magnitude without change; tasks that scale up with service size or traffic are toil

What Toil isn't: Email, Expense Reports, Commuting, Meetings - None of these have the one qualification that a tasks needs to be labelled "Toil" - "Not tied to a production service" - These are instead "Overhead"

Toil Reduction Benefits

Increased engineering time
Higher team morale, lower burnout
Increased process standardization
Enhanced team technical skills (automation)
Fewer human error outages
Shorter incident response times

3x Top Tips for Reducing Toil

Identify toil - Make sure you're differentiating it from overhead or actual engineering
Estimate the time to automate - Make sure the benefits outweigh the cost
Measure everything including context switching e.g. the time it takes you to switch to a new task and become involved in it

Risk is at the core of SRE

Generating SRE Metrics

Monitoring Reliability

The best way to measure everything is to monitor everything

"Collecting, processing, aggregating and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times and server lifetimes"

Why monitor?

Analyzing long-term trends
Comparing over time groups
Alerting (real-time)
Exposing in dashboards
Debugging
Raw input for business analytics
Security breach analysis

2x Different Types of Monitoring

White-Box	vs	Black-Box
* Metrics exposed by the internals of the system		* Testing externally visible behaviour as a user would see it
* Focus of predicting problems that encroach on your SLO		* Symptom-oriented, active problems
* Heavy use recommended		* Moderate use of critical issues
* Best for detecting imminent issues		* Best for paging of incidents

Metrics - Numerical measurements representing attributes and events

GCP's Cloud Monitoring (Formerly Stack-driver)
Collects a large number of metrics from every service at Google
Provides much less granular information, but in near real-time
Alerts and dashboards typically use metrics
Real-time nature means engineers are notified of problems rapidly
It's most critical to visualise the data in dashboards on the landing page

Logging - Append-only record of events

GCP's Cloud Logging (Stack-driver logging)
Can contain large volumes of highly granular data
- Often difficult to sift through data to find what you're looking for
Inherent delay between when an event occurs and when it is visible in logs
Logs can be proccessed with a batch system, interrogated with ad hoc queries and visualised with dashboards
Use logs to find the root cause of an issue, as the information needed is often not available as a metric
For non-time-sensitive reporting, generate detailed reports using log processing systems
Logs will nearly always produce more accurate data than metrics

Alerting Principals

"Alerts give timely awareness to problems in your cloud applications so you can resolve the problems quickly"

Set up monitoring
- Conditions are continuously monitored
- Monitoring can track SLOs
- Can look for a missing metric
- Can watch for thresholds
Track metrics over time
- Track if condition persists for given amount of time
- Time window (due to technical constraints) less than 24 hours
Notify when condition passed
- Incident created and displayed
- Alerts can be sent via: Email, Text message, apps (slack), pub/sub etc

How do you know when to setup an alert?

A key factor is how fast you're burning your error budget
- Error Budget Burn Rate = 100% - SLO * (Events over set time)
- Example: If your SLO goal is 98%, then it's acceptable for 2% of the events measured by your SLO to fail, before your SLO goal is missed. And this will tell you the burn rate, which is how fast you're consuming your error budget. A burn rate of > 1 indicates that if the currently measured error rate is sustained over any future compliance period, the service will be out of SLO for that period. Now to combat that, you need a burn rate alerting policy which will notify you when your error budget is consumed faster than the threshold you defined, as measured over that alerts compliance period.

100% - 98% = 2% = 0.002
0.002 * (12,000 fails/30 days) = BurnRate
0.002 * 400 = 0.8 - Because this is < 1, we will still have some room left in our error budget (400x fails per day)
400/24 hours = 16.67 fails/hour - Maybe bring this down to a 5 minute period, that you'd want to watch in an alerting period.

Slow Burn Alerting Policy

Warns that rate of consumption could exhaust error budget before the end of the compliance period
Less urgent than fast-burn condition
Requires longer lookback period (24 hours max)
- Determines how far back in time you're retrieving data
Threshold should be slightly higher than the baseline

Fast Burn Alerting Policy

Warns of a sudden, large change in consumption that, if uncorrected, will exhaust error budget quickly
Shorter lookback period (i.e. 1-2 hours / 1-2 mins) recommended
Set threshold much higher than the baseline, e.g. 10x to avoid overly sensitive alerts

Establishing an SLO alerting Policy

Select SLO to monitor
- Choose the desired SLO; its best to only monitor one SLO at a time
Construct a condition for alerting policy
- It's likely you'll have multiple conditions for every alerting policy, such as one for slow burn and another for fast burn
Identify notification channel
- Multiple notification channels are possible including email, SMS, pager, app, webhook, pub/sub
Provide documentation
- This is an optional but highly recommended step that provides your team with the information about the alert which can help them resolve the issue
Create alerting policy
- Bring all the pieces together to complete your alerting policy in either the console or via the API.

Investigating SRE Tools

Dev Tools

Kubernetes Engine - A managed production-ready environment for running containerized applications
Container Registry - Single place for your team to securely manage container images used by kubernetes
Cloud Build - Automates the proccess. A service that executes your builds in a series of steps where each step is run in a container
Cloud Source Repositories - Fully managed private Git respositories with integrations for CI, delivery and deployment
Spinnaker for Google Cloud - Integrates Spinnaker with other GCP services, allowing you to extend your CI/CD pipeline and integrate security and compliance in the process

Ops Tools

Cloud Monitoring - Tracking metrics, provides visibility into the performance, uptime and overall health of cloud-powered applications
Cloud Logging - Allows you to store, search, analyze, monitor and alert on log data and events from Google Cloud
Cloud Debugger - Lets you inspect the state of a running application in real time, without stopping or slowing it down
Cloud Trace - A distributed tracing system that collects latency data from your application and displays it in the console
Cloud profiler - Continuously gathers CPU usage and memory-allocation information from your production applications

Metric always displayed in an aggregate sense - How many events happened during this time period. Metric event is not about events, but about status, like "the system was handling 1200 connection requests per second @ 12:05:58. Latency of metrics and logging information, the time that it takes for the data to flow from the part of the system that has the information to the monitoring system where you can see that show up. Why isn't that instantaneous?, What happens in between? This is a good exercise to go through because not only will you then understand this measurement part of SRE much better, you'll also be able to debug things when your own measurements have some trouble. What might have gone wrong and could be blocking the flow of information? You can turn logs into metrics with filters, looking for certain types of events and then have metrics that show how many you're getting, sometimes even alert on those events. Log based metrics will always have additional latency.

Reacting to Incidents

Handling Incident Response

Poorly managed incident response

Engineer on duty freaks out because DC gone down, ops team decide to roll-back to previous version (they're just shooting from the hip). 3rd DC down, VP gets angry customer calls and it's her first knowledge of the problem, Simon (know it all) even though he's not on-call and not communicating with anyone else, thinks he can fix the issue and rolls out his fix, more DC's go down. What's wrong here? - Need to step back from trying to find the technical solution. Poor communication to the VP, freelancing agents shouldn't have access

Well manage incident response

Engineer on duty Clair; Message received, delegates Alex as incident commander, Alex brings in Ravi as part of operations team, they don't fix within their timezone, and pass on to new incident commander, along the way, they've been contributing to an incident report keeping the vice-president informed, able to sign off on some customer messaging for complete transparency. Simon is not needed and works on his side project

Initiates protocol and appoints incident commander. Works with one operations team and passes control at end of day if issue persists. VP and all stakeholders informed at start of incident in the loop, can coordinate public response. Freelance agent not wanted, called if necessary

Separation of Responsibilities
- Specific roles should be designated to team members, each with full autonomy in their role.

Roles should include:

Incident commander
- Person in charge during incident, designating responsibilities and taking all roles not designated
Operational Team
- Personnel designated with actual responses to incident. Only people that are authorized to take any action, e.g. rolling out fixes
Communication Lead
- Public face of incident response, reponsible for issuing updates to stakeholders
Planning Lead
- Supports Ops team with long-term actions, such as filing bugs, arranging hand-off if necessary and tracking system changes

Established Command Post
- The "post" could be a physical location or, more likely in a large company, a communication venue such as a slack channel
Live Incident State Document
- A shared document that reflects the current state of the incident, updated as necessary and retained for postmortem
Clear, Real-time Handoff
- If the day is ending and the issue remains unresolved, an explicit handoff to another incident commander must take place

3x Questions to Determine an Incident

If yes to any of these questions:

Need another team?
Outage visible to users?
Unresolved after an hour?

Incident Management Best Practices

Develop and document procedures
Prioritize damage and restore service - Take care of the biggest issues first
Trust team members - Give team autonomy they need without second guessing
If overwhelmed, get help
Consider response alternatives
Practice procedure routinely
Share the lead in all the roles - Rotate roles among team members, so everyone gains the experience they need to manage the incident

Managing Service Lifecycle

How does an SRE view the lifecycle of a service?

SRE Engagement Over Service Lifecycle (Graph)

Architecture and Development (1st of 5 stages)

Peaks higher than at any other point during the lifecycle
- Implementing Best Practices with the Dev Team
- Recommend best infrastructure systems
- Co-design part of service with dev team
- Avoid costly re-designs

Active Development

SREs begin productionizing the service
Planning for capacity
Adding resources for redundancy
Planning for spike and overloads
Implementing load balancing
Adding monitoring, alerting and performance tuning that will become so important when developing their SLI's & SLOs

Limited Availability (Alpha and Beta programmes)

Measure increasing performance load on the service (Begin to measure and track their SLI's)
Evaluate reliability
Define SLOs which will lead to SLA specifics
Build capacity models
Establish incident responses shared with dev team so everyone is on the same page and knows common tactics to take when problems arise

General Availability (hopefully the longest stage)

Production Readiness Review passed
SREs handle majority of op work
Incident responses
Track operational load and SLOs and making sure that everything is in accordance so the error budget is not exhausted and new features can be rolled out in a timely fashion

Depreciation

SREs operate existing system
Support transistion with Dev team
Work with dev team on designing new system; adjust staffing accordingly

"SRE principals aim to maximize the engineering velocity of developer teams while keeping products reliable"

Ensuring Healthly Operations Collaboration

Post-mortem or Retrospective

What a postmortem is not:

It's not a funeral
It's not a party

A postmortem is an investigation

Get metadata
- What systems were affected?
- What personnel were involved?
- Include machine readable data:
  - Time to identify
  - Time to act
  - Time to resolve
Recreate timeline
- When and how was the incident reported?
- When did the response start?
- When and how did we make it better?
- When was it over?
Generate report
- Report will be initiated by the incident commander
- All participants need to add their own details on actions taken
- Was there anything done that needs to be rolled back?

Remember: NO BLAME

No one is at fault
No one will be shamed
No one will be fired
Everyone learns

Production Meeting Collaboration

Upcoming production changes
- Default to enabling change, which requires tracking the useful properties of that change: start time, duration, expected effect and so on. This is called near-term horizon visibility
Metrics
- Review current SLOs, even if they are in line. Track how latency figures, CPU utilization figures, etc.. change over time
Outages
- The big picture portion of the meeting can be devoted to a synopsis of the postmortem or working on the process
Paging Events
- The tactical view: the list of pages, who was pages, what happened then, and so on. Two primary questions: should that alert have paged the way it did, and should it have paged at all?
Nonpaging Event #1
- What events didn't get paged, but probably should have?
Nonpaging Event #2 and #3
- What events occured that are not pageable and require attention? What events are not pageable and do not require attention?

Class SRE Implements DevOps

28 KiB Raw Permalink Blame History