- Reduce toil (aka repetitive manual labour!), up engineering
- Monitor all over time
-------------------------------------
Why "Reliability"
- Most important: does the product work?
- Reliability is the absense of errors
- Unstable service likely indicates a variety of issues
- Must attend to reliability all the time
(class SRE) = The how implements (DevOps) = The What
> Make better software, faster
### Understanding SLIs
SRE breaks down into 3 distinct functions
1. Define availability
1. SLO:
2. Determine level of availability
1. SLI - Quantifiable measure of reliability; Metrics over time, specific to a user journey, such as request/reponse, data processing or storage. Examples:
1. Request latency - How long it takes to return a response to a request
2. Failure Rate - A fraction of all rates received: (unsuccessful requests/all requests)
3. Batch thoughput - Proportion of time = data processing > than a threshold
3. Plan in case of failure
1. SLA
> Each maps to a key component; SLO, SLI, SLA
What's a User Journey?
* Sequence of tasks central to user experience and crucial to service
* e.g. Online shopping journeys
* Product search
* Add to cart
* Checkout
Request/Response Journey:
* Availability - Proportion of valid requests served successfully
* Latency - Proportion of valid requests served faster than a threshold
* Quality - Proportion of valid requests served maintaining quality
> None of these map specifically to a user journey, however they are all part of that
Data processing journey: Might include a different set of SLIs
* Freshness - Proportion of valid data updated more recently than a threshold
* Correctness - Proportion of valid data producing correct output
* Throughput - Proportion of time where the data processing rate is faster than a threshold
* Coverage - Proportion of valid data processed successfully
Google's 4 Golden Signals
* Latency - The time is takes for your service to fulfill a request
* Errors - The rate at which your service fails
* Traffic - How much demand is directed at your service
* Saturation - A measure of how close to fully utilized the service's resources are
Transparent SLI's within GCP Dashboard - API's & Services
**The SLI Equation:**
SLI = (Good Events / Valid Events) * 100
Valid - Known bad events are excluded from the SLI e.g. 400 http
**Bad SLI** - Variance and overlap in metrics prior to and during outages are problematic; graph contains up and down spikes during an outage
**Good SLI** - Stable signal with a strong correlation to outage is best; graph is smooth.
SLI - Metrics over time which detail the health of a service
```
Site homepage latency requests <300msoverlast5minutes@95%percentile
```
SLO - Agreed-upon bounds on how often SLIs must be met
```
95% percentile homepage SLI will sucess 99.9% of the time over the next year
```
SLO - Critical that there is buy-in from across the organisation. Make sure all stakeholders agree on the SLO. Everyone on the same team working towards the same goals. Developers, Contributors, Project Managers, SREs, Vice President
#### Make your SLOs achievable
- Based on past performance
- Users expectations are strongly tied to past performance
- If no historical data, you need to collect some
- Keep in mind: measurement <> User satisfaction, and you may need to adjust your SLOs accordingly
#### In addition to achievable SLOs, you might have some aspirational SLOs
_"We've determined the level of availability with our SLIs and declared what target of availability we want to reach with our SLO's and now we need to describe what happens if we don't maintain that availability with an SLA"_
We want our SLO to be at or under 200ms, and therefore we want to set our SLA at a higher value e.g. 300ms, and beyond that, "hair on fire ms". You want to set your SLA significantly different to the SLO because you don't want to be consistently setting your customers hair on fire. Moreover your SLA's should be routinely achievable because you are working towards your objective (SLO) constantly.
An SLI is an indicator of how things are at some particular point in time. Are things good or bad right now? - If our SLI doesn't always confidently tell us that, then it's not a good SLI. An SLO asks whether that indicator has been showing what we want it to for most of the time period that we care about. Has the SLI met our objective? In particular have things been the right amount of both good and bad? An SLA is our agreement of what happens if we don't meet our objective. What's our punishment for doing worse than we agreed we would. The SLI is like the speed of a car. It's travelling at some particular speed right now. The SLO is the speed limit at the upper end and the expected travel time for your trip at the lower end. You want to go approximately some speed over the course of your journey. The SLA is like getting a speeding ticket because you drove to fast, or drove too slowly.
In SRE, an error budget is a good thing, and one of the unifying principals between developers and operations.
_"A quantitative measurement shared between the product and SRE teans to balance innovation and stability"_
The process for defining an error budget:
1. Management buys into the internal SLO
1. SLO used to determine the amount of uptime for that particular quarter
1. Monitoring services measures actual uptime
1. The difference between the expected SLO and the actual uptime is calculated
1. Then, if there's sufficient time within the error budget the new release is pushed forward
Error budgets seem like risky business, why not avoid risk at all costs? - 1 of the key DevOps pillars is that failure is unavoidable and therefore so is risk. SRE's manage service reliability, largely by managing risk.
- Balances innovation and reliability
An error budget provides a common incentive that allows both product development and SRE's to focus on finding the right balance between new features and service availability.
- Manages release velocity
So as long as the systems SLO's are met, releases can continue.
- Developers oversee own risk
When the error budget is large, product developers can take more risks because they have more time to spend. When the budget is almost over with, nearly drained, product developers themselves will push for more testing and slow down their release velocity, because they don't want to risk using the error budget and stalling their launch completely.
- Time-based error budgets are not valid because downtime is almost never universal. Typically an outage will only occur in one part of a system at a time, so it's;
Toil: _"Work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows"_
1. Manual - This characteristic extends to include the running of a script - which, although saves time must still be run by hand
2. Repetitive - If a task is repeated multiple times, not just the once or twice, then the work is toil
3. Automatable - Should the task be done by a machine just as well as by a person, you can consider it toil
4. Tactical - Toil, by its very nature, is not proactive or strategy driven. Rather, it is reactive and interrupt-driven; e.g. pager alerts
5. Devoid of enduring value - Tasks that contribute to adding a permanent improvement to the service are not considered toil, but work that does not change the state, is.
6. Scales linearly as service grows - The best designed service can grow by at least one order of magnitude without change; tasks that scale up with service size or traffic are toil
What Toil isn't: Email, Expense Reports, Commuting, Meetings - None of these have the one qualification that a tasks needs to be labelled "Toil" - _"Not tied to a production service"_ - These are instead "Overhead"
The best way to measure everything is to monitor everything
_"Collecting, processing, aggregating and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times and server lifetimes"_
_"Alerts give timely awareness to problems in your cloud applications so you can resolve the problems quickly"_
1. Set up monitoring
- Conditions are continuously monitored
- Monitoring can track SLOs
- Can look for a missing metric
- Can watch for thresholds
2. Track metrics over time
- Track if condition persists for given amount of time
- Time window (due to technical constraints) less than 24 hours
3. Notify when condition passed
- Incident created and displayed
- Alerts can be sent via: Email, Text message, apps (slack), pub/sub etc
How do you know when to setup an alert?
- A key factor is how fast you're burning your error budget
- Error Budget Burn Rate = 100% - SLO * (Events over set time)
- Example: If your SLO goal is 98%, then it's acceptable for 2% of the events measured by your SLO to fail, before your SLO goal is missed. And this will tell you the burn rate, which is how fast you're consuming your error budget. A burn rate of > 1 indicates that if the currently measured error rate is sustained over any future compliance period, the service will be out of SLO for that period. Now to combat that, you need a burn rate alerting policy which will notify you when your error budget is consumed faster than the threshold you defined, as measured over that alerts compliance period.
- Kubernetes Engine - A managed production-ready environment for running containerized applications
- Container Registry - Single place for your team to securely manage container images used by kubernetes
- Cloud Build - Automates the proccess. A service that executes your builds in a series of steps where each step is run in a container
- Cloud Source Repositories - Fully managed private Git respositories with integrations for CI, delivery and deployment
- Spinnaker for Google Cloud - Integrates Spinnaker with other GCP services, allowing you to extend your CI/CD pipeline and integrate security and compliance in the process
Ops Tools
- Cloud Monitoring - Tracking metrics, provides visibility into the performance, uptime and overall health of cloud-powered applications
- Cloud Logging - Allows you to store, search, analyze, monitor and alert on log data and events from Google Cloud
- Cloud Debugger - Lets you inspect the state of a running application in real time, without stopping or slowing it down
- Cloud Trace - A distributed tracing system that collects latency data from your application and displays it in the console
- Cloud profiler - Continuously gathers CPU usage and memory-allocation information from your production applications
Metric always displayed in an aggregate sense - How many events happened during this time period. Metric event is not about events, but about status, like "the system was handling 1200 connection requests per second @ 12:05:58. Latency of metrics and logging information, the time that it takes for the data to flow from the part of the system that has the information to the monitoring system where you can see that show up. Why isn't that instantaneous?, What happens in between? This is a good exercise to go through because not only will you then understand this measurement part of SRE much better, you'll also be able to debug things when your own measurements have some trouble. What might have gone wrong and could be blocking the flow of information? You can turn logs into metrics with filters, looking for certain types of events and then have metrics that show how many you're getting, sometimes even alert on those events. Log based metrics will always have additional latency.
Engineer on duty freaks out because DC gone down, ops team decide to roll-back to previous version (they're just shooting from the hip). 3rd DC down, VP gets angry customer calls and it's her first knowledge of the problem, Simon (know it all) even though he's not on-call and not communicating with anyone else, thinks he can fix the issue and rolls out his fix, more DC's go down. What's wrong here? - Need to step back from trying to find the technical solution. Poor communication to the VP, freelancing agents shouldn't have access
Well manage incident response
Engineer on duty Clair; Message received, delegates Alex as incident commander, Alex brings in Ravi as part of operations team, they don't fix within their timezone, and pass on to new incident commander, along the way, they've been contributing to an incident report keeping the vice-president informed, able to sign off on some customer messaging for complete transparency. Simon is not needed and works on his side project
Initiates protocol and appoints incident commander. Works with one operations team and passes control at end of day if issue persists. VP and all stakeholders informed at start of incident in the loop, can coordinate public response. Freelance agent not wanted, called if necessary
1. Separation of Responsibilities
- Specific roles should be designated to team members, each with full autonomy in their role.
Roles should include:
- Incident commander
- Person in charge during incident, designating responsibilities and taking all roles not designated
- Operational Team
- Personnel designated with actual responses to incident. Only people that are authorized to take any action, e.g. rolling out fixes
- Communication Lead
- Public face of incident response, reponsible for issuing updates to stakeholders
- Planning Lead
- Supports Ops team with long-term actions, such as filing bugs, arranging hand-off if necessary and tracking system changes
2. Established Command Post
- The "post" could be a physical location or, more likely in a large company, a communication venue such as a slack channel
**Architecture and Development (1st of 5 stages)**
- Peaks higher than at any other point during the lifecycle
- Implementing Best Practices with the Dev Team
- Recommend best infrastructure systems
- Co-design part of service with dev team
- Avoid costly re-designs
**Active Development**
- SREs begin productionizing the service
- Planning for capacity
- Adding resources for redundancy
- Planning for spike and overloads
- Implementing load balancing
- Adding monitoring, alerting and performance tuning that will become so important when developing their SLI's & SLOs
**Limited Availability (Alpha and Beta programmes)**
- Measure increasing performance load on the service (Begin to measure and track their SLI's)
- Evaluate reliability
- Define SLOs which will lead to SLA specifics
- Build capacity models
- Establish incident responses shared with dev team so everyone is on the same page and knows common tactics to take when problems arise
**General Availability (hopefully the longest stage)**
- Production Readiness Review passed
- SREs handle majority of op work
- Incident responses
- Track operational load and SLOs and making sure that everything is in accordance so the error budget is not exhausted and new features can be rolled out in a timely fashion
**Depreciation**
- SREs operate existing system
- Support transistion with Dev team
- Work with dev team on designing new system; adjust staffing accordingly
_"SRE principals aim to maximize the engineering velocity of developer teams while keeping products reliable"_
- Report will be initiated by the incident commander
- All participants need to add their own details on actions taken
- Was there anything done that needs to be rolled back?
**Remember: NO BLAME**
- No one is at fault
- No one will be shamed
- No one will be fired
- Everyone learns
Production Meeting Collaboration
1. Upcoming production changes
- Default to enabling change, which requires tracking the useful properties of that change: start time, duration, expected effect and so on. This is called near-term horizon visibility
- The tactical view: the list of pages, who was pages, what happened then, and so on. Two primary questions: should that alert have paged the way it did, and should it have paged at all?