28 KiB
Big Picture - What is SRE?
5 Key Pillars of DevOps + SRE
- Reduce organisational silos
- Bridge teams together
- Increase communication
- Shared company vision
Share ownership
- Developers + Operations
- Implement same tooling
- share same techniques
- Accept failure as normal
- Try to anticipate, but
- Incidents are bound to occur
- Failure help team learn
No-fault post mortems & SLOs - No two failures the same (goal) - Track incidents (SLI's) - Map to objectives (SLOs)
- Implement gradual change
- Continuous change culture
- Small updates are better
- Easier to review
- Easier to rollback
Reduce costs of failures
- Limited "canary" rollouts
- Impact fewest users
- Automate where possible for further cost reduction
- Leverage Tooling and automation
- Reduce manual tasks
- The heat of the CI/CD pipelines
- Fosters speed & consistency
Automate this years job away
- Automation is a force multiplier
- Autonomous automation best
- Centralizes mistakes
- Measure Everything
- Critical guage of success
- CI/CD needs full monitoring
- Synthetic, proactive monitoring
Measure toil and reliability
- Key to SLOs and SLAs
- Reduce toil (aka repetitive manual labour!), up engineering
- Monitor all over time
Why "Reliability"
- Most important: does the product work?
- Reliability is the absense of errors
- Unstable service likely indicates a variety of issues
- Must attend to reliability all the time
(class SRE) = The how implements (DevOps) = The What
Make better software, faster
Understanding SLIs
SRE breaks down into 3 distinct functions
- Define availability
- SLO:
- Determine level of availability
- SLI - Quantifiable measure of reliability; Metrics over time, specific to a user journey, such as request/reponse, data processing or storage. Examples:
- Request latency - How long it takes to return a response to a request
- Failure Rate - A fraction of all rates received: (unsuccessful requests/all requests)
- Batch thoughput - Proportion of time = data processing > than a threshold
- SLI - Quantifiable measure of reliability; Metrics over time, specific to a user journey, such as request/reponse, data processing or storage. Examples:
- Plan in case of failure
- SLA
Each maps to a key component; SLO, SLI, SLA
What's a User Journey?
- Sequence of tasks central to user experience and crucial to service
- e.g. Online shopping journeys
- Product search
- Add to cart
- Checkout
- e.g. Online shopping journeys
Request/Response Journey:
- Availability - Proportion of valid requests served successfully
- Latency - Proportion of valid requests served faster than a threshold
- Quality - Proportion of valid requests served maintaining quality
None of these map specifically to a user journey, however they are all part of that
Data processing journey: Might include a different set of SLIs
- Freshness - Proportion of valid data updated more recently than a threshold
- Correctness - Proportion of valid data producing correct output
- Throughput - Proportion of time where the data processing rate is faster than a threshold
- Coverage - Proportion of valid data processed successfully
Google's 4 Golden Signals
- Latency - The time is takes for your service to fulfill a request
- Errors - The rate at which your service fails
- Traffic - How much demand is directed at your service
- Saturation - A measure of how close to fully utilized the service's resources are
Transparent SLI's within GCP Dashboard - API's & Services
The SLI Equation:
SLI = (Good Events / Valid Events) * 100
Valid - Known bad events are excluded from the SLI e.g. 400 http
Bad SLI - Variance and overlap in metrics prior to and during outages are problematic; graph contains up and down spikes during an outage Good SLI - Stable signal with a strong correlation to outage is best; graph is smooth.
SLI Best Practices
- Limit number of SLIs
- 3-5 per user journey
- Too many increase difficulty for operators
- Can lead to contradictions
- Reduce complexity
- Not all metrics make good SLIs
- Increased response time
- Many false positive
- Prioritize Journeys
- Select most valuable to users
- Identify user-centric events
- Aggregate similar SLIs
- Collect data over time
- Turn into a rate, average, or percentile
- Bucket to distinguish response classes
- Not all request are same
- Requesters may be human, background apps or bots
- Combine (or "bucket") for better SLIs
- Collect data at load balancer
- Most efficient method
- Closer to users's experience
Understanding SLOs
"SLO's specify a target level for the reliability of your service"
The First rule of SLOs: 100 % reliability is not a good objective
Why?
- Trying to reach 100%, 100% of the time, is very expensive in terms of resources
- Much more technically complex
- Users don't need 100% to be acceptable (get close enough where users don't notice the difference, just reliable enough)
- Less than 100% leaves room for new features, as you have resources remaining to develop (error budgets)
SLOs are tied directly to SLIs
- Measured by SLI
- Can be a single target value or range of values
- e.g. SLI <= SLO or
- (lower bound <= SLI <= upper bound) = SLO
- Common SLOs: 99.5%, 99.9% (3x 9's), 99.99% (4x 9's)
SLI - Metrics over time which detail the health of a service
Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile
SLO - Agreed-upon bounds on how often SLIs must be met
95% percentile homepage SLI will sucess 99.9% of the time over the next year
SLO - Critical that there is buy-in from across the organisation. Make sure all stakeholders agree on the SLO. Everyone on the same team working towards the same goals. Developers, Contributors, Project Managers, SREs, Vice President
Make your SLOs achievable
- Based on past performance
- Users expectations are strongly tied to past performance
- If no historical data, you need to collect some
- Keep in mind: measurement <> User satisfaction, and you may need to adjust your SLOs accordingly
In addition to achievable SLOs, you might have some aspirational SLOs
- Typically higher than your achievable SLO's
- Set a reasonable target and begin measuring
- Compare user feedback to SLOs
Understanding SLAs
"We've determined the level of availability with our SLIs and declared what target of availability we want to reach with our SLO's and now we need to describe what happens if we don't maintain that availability with an SLA"
"An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain"
- Should reliability fail, there are consequences
SLA Characteristics
- A business-level agreement
- SRE's are not usually involved with the drafting of SLA's except when setting up their SLIs and the corresponding SLOs
- Can be explicit or implied (implicit)
- Explicit contract contain consequences
- Refund for services paid for
- Service cost reduction on sliding scales
- May be offered on a per service basis
41 SLAs in GCP
-
Compute Engine
- 4x 9's for Instance uptime in multiple zones
- 99.5% for uptime for a single instance
- 4x 9's for load balancing uptime
-
If these aren't met, and customer does meet its obligations then the customer could be eligible to some financial credits
-
Clear definitions for language
SLIs drives SLOs, SLOs inform the SLA
- How does the SLO inform the SLA; example:
We want our SLO to be at or under 200ms, and therefore we want to set our SLA at a higher value e.g. 300ms, and beyond that, "hair on fire ms". You want to set your SLA significantly different to the SLO because you don't want to be consistently setting your customers hair on fire. Moreover your SLA's should be routinely achievable because you are working towards your objective (SLO) constantly.
To summarize:
SLO:
- Internal targets that guide prioritization
- Represents desired user experience
- Missing objectives should have consequences
SLA
- Set level just enough to keep customers
- Incentizes minimum level of service
- Looser than corresponding objectives
An SLI is an indicator of how things are at some particular point in time. Are things good or bad right now? - If our SLI doesn't always confidently tell us that, then it's not a good SLI. An SLO asks whether that indicator has been showing what we want it to for most of the time period that we care about. Has the SLI met our objective? In particular have things been the right amount of both good and bad? An SLA is our agreement of what happens if we don't meet our objective. What's our punishment for doing worse than we agreed we would. The SLI is like the speed of a car. It's travelling at some particular speed right now. The SLO is the speed limit at the upper end and the expected travel time for your trip at the lower end. You want to go approximately some speed over the course of your journey. The SLA is like getting a speeding ticket because you drove to fast, or drove too slowly.
Making the Most of Risk
Setting Error Budgets
In SRE, an error budget is a good thing, and one of the unifying principals between developers and operations.
"A quantitative measurement shared between the product and SRE teans to balance innovation and stability"
The process for defining an error budget:
- Management buys into the internal SLO
- SLO used to determine the amount of uptime for that particular quarter
- Monitoring services measures actual uptime
- The difference between the expected SLO and the actual uptime is calculated
- Then, if there's sufficient time within the error budget the new release is pushed forward
Error budgets seem like risky business, why not avoid risk at all costs? - 1 of the key DevOps pillars is that failure is unavoidable and therefore so is risk. SRE's manage service reliability, largely by managing risk.
-
Balances innovation and reliability An error budget provides a common incentive that allows both product development and SRE's to focus on finding the right balance between new features and service availability.
-
Manages release velocity So as long as the systems SLO's are met, releases can continue.
-
Developers oversee own risk When the error budget is large, product developers can take more risks because they have more time to spend. When the budget is almost over with, nearly drained, product developers themselves will push for more testing and slow down their release velocity, because they don't want to risk using the error budget and stalling their launch completely.
-
What happens if the error budget exceeded:
- Typically releases temporarily halted
- Expansion in system testing and development
- Which will overall improve performance
Error Budget
Error budget = 100% - SLO value
e.g.
SLO == 99.8% 100% - 99.8% = 0.2%
How will the 0.2% be applied over time?
0.2% == 0.002 (0.2% written as a decimal e.g. 1% == 0.01, 10% = 0.1 etc)
0.002 * 30 day/month * 24 hours/day * 60 minutes/hour = 86.4 minutes/month
This may not seem like much, but this is actual downtime.
What about global services?
- Time-based error budgets are not valid because downtime is almost never universal. Typically an outage will only occur in one part of a system at a time, so it's;
- Better to define availability in terms of request success rate
- Referred to as aggregate availability, and the formula is a little different
- Availability = Successful Requests / Total Requests
- e.g. a System that serves 1M requests per day with 99.9% (3x 9's) availability, can serve up to 1000 errors and still hit it's target for that day
- Availability = Successful Requests / Total Requests
- Referred to as aggregate availability, and the formula is a little different
999000/1000000 = 0.999 == 99.9%
So, if we take A = S/T, S = A*T and change our target availability to 99.8%
100% - 99.8% = 0.2%, 0.2% (in decimal) == 0.002
Successful Requests (Really this is allowed errors based on target availability) = 0.002 * 1,000,000 = 2000 Errors, 1,000,000 - 2000 = 998,000 Successful requests
Error budgets, what are they good for?
- Releasing new features
- Top use by the product team
- Expected system changes
- roll out enhancements, good to know you are covered should something go wrong
- Inevitable failure in networks, etc
- Planned downtime
- e.g. take the entire system offline to implement a major upgrade
- Risky experiments
- Unforseen circumstances (unknown unknowns) e.g. Global pandemic!
Defining and Reducing Toil
Toil: "Work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows"
- Manual - This characteristic extends to include the running of a script - which, although saves time must still be run by hand
- Repetitive - If a task is repeated multiple times, not just the once or twice, then the work is toil
- Automatable - Should the task be done by a machine just as well as by a person, you can consider it toil
- Tactical - Toil, by its very nature, is not proactive or strategy driven. Rather, it is reactive and interrupt-driven; e.g. pager alerts
- Devoid of enduring value - Tasks that contribute to adding a permanent improvement to the service are not considered toil, but work that does not change the state, is.
- Scales linearly as service grows - The best designed service can grow by at least one order of magnitude without change; tasks that scale up with service size or traffic are toil
What Toil isn't: Email, Expense Reports, Commuting, Meetings - None of these have the one qualification that a tasks needs to be labelled "Toil" - "Not tied to a production service" - These are instead "Overhead"
Toil Reduction Benefits
- Increased engineering time
- Higher team morale, lower burnout
- Increased process standardization
- Enhanced team technical skills (automation)
- Fewer human error outages
- Shorter incident response times
3x Top Tips for Reducing Toil
- Identify toil - Make sure you're differentiating it from overhead or actual engineering
- Estimate the time to automate - Make sure the benefits outweigh the cost
- Measure everything including context switching e.g. the time it takes you to switch to a new task and become involved in it
Risk is at the core of SRE
Generating SRE Metrics
Monitoring Reliability
The best way to measure everything is to monitor everything
"Collecting, processing, aggregating and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times and server lifetimes"
Why monitor?
- Analyzing long-term trends
- Comparing over time groups
- Alerting (real-time)
- Exposing in dashboards
- Debugging
- Raw input for business analytics
- Security breach analysis
2x Different Types of Monitoring
| White-Box | vs | Black-Box |
|---|---|---|
| * Metrics exposed by the internals of the system | * Testing externally visible behaviour as a user would see it | |
| * Focus of predicting problems that encroach on your SLO | * Symptom-oriented, active problems | |
| * Heavy use recommended | * Moderate use of critical issues | |
| * Best for detecting imminent issues | * Best for paging of incidents |
Metrics - Numerical measurements representing attributes and events
- GCP's Cloud Monitoring (Formerly Stack-driver)
- Collects a large number of metrics from every service at Google
- Provides much less granular information, but in near real-time
- Alerts and dashboards typically use metrics
- Real-time nature means engineers are notified of problems rapidly
- It's most critical to visualise the data in dashboards on the landing page
Logging - Append-only record of events
- GCP's Cloud Logging (Stack-driver logging)
- Can contain large volumes of highly granular data
- Often difficult to sift through data to find what you're looking for
- Inherent delay between when an event occurs and when it is visible in logs
- Logs can be proccessed with a batch system, interrogated with ad hoc queries and visualised with dashboards
- Use logs to find the root cause of an issue, as the information needed is often not available as a metric
- For non-time-sensitive reporting, generate detailed reports using log processing systems
- Logs will nearly always produce more accurate data than metrics
Alerting Principals
"Alerts give timely awareness to problems in your cloud applications so you can resolve the problems quickly"
-
Set up monitoring
- Conditions are continuously monitored
- Monitoring can track SLOs
- Can look for a missing metric
- Can watch for thresholds
-
Track metrics over time
- Track if condition persists for given amount of time
- Time window (due to technical constraints) less than 24 hours
-
Notify when condition passed
- Incident created and displayed
- Alerts can be sent via: Email, Text message, apps (slack), pub/sub etc
How do you know when to setup an alert?
- A key factor is how fast you're burning your error budget
- Error Budget Burn Rate = 100% - SLO * (Events over set time)
- Example: If your SLO goal is 98%, then it's acceptable for 2% of the events measured by your SLO to fail, before your SLO goal is missed. And this will tell you the burn rate, which is how fast you're consuming your error budget. A burn rate of > 1 indicates that if the currently measured error rate is sustained over any future compliance period, the service will be out of SLO for that period. Now to combat that, you need a burn rate alerting policy which will notify you when your error budget is consumed faster than the threshold you defined, as measured over that alerts compliance period.
100% - 98% = 2% = 0.002
0.002 * (12,000 fails/30 days) = BurnRate
0.002 * 400 = 0.8 - Because this is < 1, we will still have some room left in our error budget (400x fails per day)
400/24 hours = 16.67 fails/hour - Maybe bring this down to a 5 minute period, that you'd want to watch in an alerting period.
Slow Burn Alerting Policy
- Warns that rate of consumption could exhaust error budget before the end of the compliance period
- Less urgent than fast-burn condition
- Requires longer lookback period (24 hours max)
- Determines how far back in time you're retrieving data
- Threshold should be slightly higher than the baseline
Fast Burn Alerting Policy
- Warns of a sudden, large change in consumption that, if uncorrected, will exhaust error budget quickly
- Shorter lookback period (i.e. 1-2 hours / 1-2 mins) recommended
- Set threshold much higher than the baseline, e.g. 10x to avoid overly sensitive alerts
Establishing an SLO alerting Policy
- Select SLO to monitor
- Choose the desired SLO; its best to only monitor one SLO at a time
- Construct a condition for alerting policy
- It's likely you'll have multiple conditions for every alerting policy, such as one for slow burn and another for fast burn
- Identify notification channel
- Multiple notification channels are possible including email, SMS, pager, app, webhook, pub/sub
- Provide documentation
- This is an optional but highly recommended step that provides your team with the information about the alert which can help them resolve the issue
- Create alerting policy
- Bring all the pieces together to complete your alerting policy in either the console or via the API.
Investigating SRE Tools
Dev Tools
-
Kubernetes Engine - A managed production-ready environment for running containerized applications
-
Container Registry - Single place for your team to securely manage container images used by kubernetes
-
Cloud Build - Automates the proccess. A service that executes your builds in a series of steps where each step is run in a container
-
Cloud Source Repositories - Fully managed private Git respositories with integrations for CI, delivery and deployment
-
Spinnaker for Google Cloud - Integrates Spinnaker with other GCP services, allowing you to extend your CI/CD pipeline and integrate security and compliance in the process
Ops Tools
- Cloud Monitoring - Tracking metrics, provides visibility into the performance, uptime and overall health of cloud-powered applications
- Cloud Logging - Allows you to store, search, analyze, monitor and alert on log data and events from Google Cloud
- Cloud Debugger - Lets you inspect the state of a running application in real time, without stopping or slowing it down
- Cloud Trace - A distributed tracing system that collects latency data from your application and displays it in the console
- Cloud profiler - Continuously gathers CPU usage and memory-allocation information from your production applications
Metric always displayed in an aggregate sense - How many events happened during this time period. Metric event is not about events, but about status, like "the system was handling 1200 connection requests per second @ 12:05:58. Latency of metrics and logging information, the time that it takes for the data to flow from the part of the system that has the information to the monitoring system where you can see that show up. Why isn't that instantaneous?, What happens in between? This is a good exercise to go through because not only will you then understand this measurement part of SRE much better, you'll also be able to debug things when your own measurements have some trouble. What might have gone wrong and could be blocking the flow of information? You can turn logs into metrics with filters, looking for certain types of events and then have metrics that show how many you're getting, sometimes even alert on those events. Log based metrics will always have additional latency.
Reacting to Incidents
Handling Incident Response
Poorly managed incident response
Engineer on duty freaks out because DC gone down, ops team decide to roll-back to previous version (they're just shooting from the hip). 3rd DC down, VP gets angry customer calls and it's her first knowledge of the problem, Simon (know it all) even though he's not on-call and not communicating with anyone else, thinks he can fix the issue and rolls out his fix, more DC's go down. What's wrong here? - Need to step back from trying to find the technical solution. Poor communication to the VP, freelancing agents shouldn't have access
Well manage incident response
Engineer on duty Clair; Message received, delegates Alex as incident commander, Alex brings in Ravi as part of operations team, they don't fix within their timezone, and pass on to new incident commander, along the way, they've been contributing to an incident report keeping the vice-president informed, able to sign off on some customer messaging for complete transparency. Simon is not needed and works on his side project
Initiates protocol and appoints incident commander. Works with one operations team and passes control at end of day if issue persists. VP and all stakeholders informed at start of incident in the loop, can coordinate public response. Freelance agent not wanted, called if necessary
- Separation of Responsibilities
- Specific roles should be designated to team members, each with full autonomy in their role.
Roles should include:
- Incident commander
- Person in charge during incident, designating responsibilities and taking all roles not designated
- Operational Team
- Personnel designated with actual responses to incident. Only people that are authorized to take any action, e.g. rolling out fixes
- Communication Lead
- Public face of incident response, reponsible for issuing updates to stakeholders
- Planning Lead
- Supports Ops team with long-term actions, such as filing bugs, arranging hand-off if necessary and tracking system changes
- Established Command Post
- The "post" could be a physical location or, more likely in a large company, a communication venue such as a slack channel
- The "post" could be a physical location or, more likely in a large company, a communication venue such as a slack channel
- Live Incident State Document
- A shared document that reflects the current state of the incident, updated as necessary and retained for postmortem
- A shared document that reflects the current state of the incident, updated as necessary and retained for postmortem
- Clear, Real-time Handoff
- If the day is ending and the issue remains unresolved, an explicit handoff to another incident commander must take place
3x Questions to Determine an Incident
If yes to any of these questions:
- Need another team?
- Outage visible to users?
- Unresolved after an hour?
Incident Management Best Practices
- Develop and document procedures
- Prioritize damage and restore service - Take care of the biggest issues first
- Trust team members - Give team autonomy they need without second guessing
- If overwhelmed, get help
- Consider response alternatives
- Practice procedure routinely
- Share the lead in all the roles - Rotate roles among team members, so everyone gains the experience they need to manage the incident
Managing Service Lifecycle
How does an SRE view the lifecycle of a service?
SRE Engagement Over Service Lifecycle (Graph)
Architecture and Development (1st of 5 stages)
- Peaks higher than at any other point during the lifecycle
- Implementing Best Practices with the Dev Team
- Recommend best infrastructure systems
- Co-design part of service with dev team
- Avoid costly re-designs
Active Development
- SREs begin productionizing the service
- Planning for capacity
- Adding resources for redundancy
- Planning for spike and overloads
- Implementing load balancing
- Adding monitoring, alerting and performance tuning that will become so important when developing their SLI's & SLOs
Limited Availability (Alpha and Beta programmes)
- Measure increasing performance load on the service (Begin to measure and track their SLI's)
- Evaluate reliability
- Define SLOs which will lead to SLA specifics
- Build capacity models
- Establish incident responses shared with dev team so everyone is on the same page and knows common tactics to take when problems arise
General Availability (hopefully the longest stage)
- Production Readiness Review passed
- SREs handle majority of op work
- Incident responses
- Track operational load and SLOs and making sure that everything is in accordance so the error budget is not exhausted and new features can be rolled out in a timely fashion
Depreciation
- SREs operate existing system
- Support transistion with Dev team
- Work with dev team on designing new system; adjust staffing accordingly
"SRE principals aim to maximize the engineering velocity of developer teams while keeping products reliable"
Ensuring Healthly Operations Collaboration
Post-mortem or Retrospective
What a postmortem is not:
- It's not a funeral
- It's not a party
A postmortem is an investigation
- Get metadata
- What systems were affected?
- What personnel were involved?
- Include machine readable data:
- Time to identify
- Time to act
- Time to resolve
- Recreate timeline
- When and how was the incident reported?
- When did the response start?
- When and how did we make it better?
- When was it over?
- Generate report
- Report will be initiated by the incident commander
- All participants need to add their own details on actions taken
- Was there anything done that needs to be rolled back?
Remember: NO BLAME
- No one is at fault
- No one will be shamed
- No one will be fired
- Everyone learns
Production Meeting Collaboration
- Upcoming production changes
- Default to enabling change, which requires tracking the useful properties of that change: start time, duration, expected effect and so on. This is called near-term horizon visibility
- Default to enabling change, which requires tracking the useful properties of that change: start time, duration, expected effect and so on. This is called near-term horizon visibility
- Metrics
- Review current SLOs, even if they are in line. Track how latency figures, CPU utilization figures, etc.. change over time
- Review current SLOs, even if they are in line. Track how latency figures, CPU utilization figures, etc.. change over time
- Outages
- The big picture portion of the meeting can be devoted to a synopsis of the postmortem or working on the process
- The big picture portion of the meeting can be devoted to a synopsis of the postmortem or working on the process
- Paging Events
- The tactical view: the list of pages, who was pages, what happened then, and so on. Two primary questions: should that alert have paged the way it did, and should it have paged at all?
- The tactical view: the list of pages, who was pages, what happened then, and so on. Two primary questions: should that alert have paged the way it did, and should it have paged at all?
- Nonpaging Event
#1- What events didn't get paged, but probably should have?
- What events didn't get paged, but probably should have?
- Nonpaging Event
#2and#3- What events occured that are not pageable and require attention? What events are not pageable and do not require attention?