gpcdesre/Part_2.md

Big Picture - What is SRE?

### 5 Key Pillars of DevOps + SRE

- Reduce organisational silos
  - Bridge teams together
  - Increase communication
  - Shared company vision

**Share ownership**
- Developers + Operations
- Implement same tooling
- share same techniques

-------------------------------------

- Accept failure as normal
  - Try to anticipate, but
  - Incidents are bound to occur
  - Failure help team learn

**No-fault post mortems & SLOs**
    - No two failures the same (goal)
    - Track incidents (SLI's)
    - Map to objectives (SLOs)

--------------------------------------

- Implement gradual change
  - Continuous change culture
  - Small updates are better
  - Easier to review
  - Easier to rollback

**Reduce costs of failures**
- Limited "canary" rollouts
- Impact fewest users
- Automate where possible for further cost reduction

-------------------------------------

- Leverage Tooling and automation
  - Reduce manual tasks
  - The heat of the CI/CD pipelines
  - Fosters speed & consistency

**Automate this years job away**
- Automation is a force multiplier
- Autonomous automation best
- Centralizes mistakes

-------------------------------------

- Measure Everything
  - Critical guage of success
  - CI/CD needs full monitoring
  - Synthetic, proactive monitoring

Measure toil and reliability
- Key to SLOs and SLAs
- Reduce toil (aka repetitive manual labour!), up engineering
- Monitor all over time

-------------------------------------

Why "Reliability"
- Most important: does the product work?
- Reliability is the absense of errors
- Unstable service likely indicates a variety of issues
- Must attend to reliability all the time


(class SRE) = The how implements (DevOps) = The What

> Make better software, faster

### Understanding SLIs

SRE breaks down into 3 distinct functions

1. Define availability
   1. SLO:
2. Determine level of availability
   1. SLI - Quantifiable measure of reliability; Metrics over time, specific to a user journey, such as request/reponse, data processing or storage. Examples:
      1. Request latency - How long it takes to return a response to a request
      2. Failure Rate - A fraction of all rates received: (unsuccessful requests/all requests)
      3. Batch thoughput - Proportion of time = data processing > than a threshold
3. Plan in case of failure
   1. SLA

> Each maps to a key component; SLO, SLI, SLA


What's a User Journey?
* Sequence of tasks central to user experience and crucial to service
  * e.g. Online shopping journeys
    * Product search
    * Add to cart
    * Checkout

Request/Response Journey:
* Availability - Proportion of valid requests served successfully
* Latency - Proportion of valid requests served faster than a threshold
* Quality - Proportion of valid requests served maintaining quality

> None of these map specifically to a user journey, however they are all part of that

Data processing journey: Might include a different set of SLIs
* Freshness - Proportion of valid data updated more recently than a threshold
* Correctness - Proportion of valid data producing correct output
* Throughput - Proportion of time where the data processing rate is faster than a threshold
* Coverage - Proportion of valid data processed successfully

Google's 4 Golden Signals
* Latency - The time is takes for your service to fulfill a request
* Errors - The rate at which your service fails
* Traffic - How much demand is directed at your service
* Saturation - A measure of how close to fully utilized the service's resources are

Transparent SLI's within GCP Dashboard - API's & Services

**The SLI Equation:**

SLI = (Good Events / Valid Events) * 100

Valid - Known bad events are excluded from the SLI e.g. 400 http

**Bad SLI** - Variance and overlap in metrics prior to and during outages are problematic; graph contains up and down spikes during an outage
**Good SLI** - Stable signal with a strong correlation to outage is best; graph is smooth.

#### SLI Best Practices

1. Limit number of SLIs
   * 3-5 per user journey
   * Too many increase difficulty for operators
   * Can lead to contradictions
<br>
2. Reduce complexity
   * Not all metrics make good SLIs
   * Increased response time
   * Many false positive
<br>
3. Prioritize Journeys
   * Select most valuable to users
   * Identify user-centric events
<br>
4. Aggregate similar SLIs
   * Collect data over time
   * Turn into a rate, average, or percentile
<br>
5. Bucket to distinguish response classes
   * Not all request are same
   * Requesters may be human, background apps or bots
   * Combine (or "bucket") for better SLIs
<br>
6. Collect data at load balancer
   * Most efficient method
   * Closer to users's experience

### Understanding SLOs

"SLO's specify a target level for the reliability of your service"

The First rule of SLOs: 100 % reliability is not a good objective

Why?

- Trying to reach 100%, 100% of the time, is very expensive in terms of resources
- Much more technically complex
- Users don't need 100% to be acceptable (get close enough where users don't notice the difference, just reliable enough)
- Less than 100% leaves room for new features, as you have resources remaining to develop (error budgets)

SLOs are tied directly to SLIs
- Measured by SLI
- Can be a single target value or range of values
  - e.g. SLI <= SLO
    or
  - (lower bound <= SLI <= upper bound) = SLO
  - Common SLOs: 99.5%, 99.9% (3x 9's), 99.99% (4x 9's)

SLI - Metrics over time which detail the health of a service

```
Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile
```

SLO - Agreed-upon bounds on how often SLIs must be met

```
95% percentile homepage SLI will sucess 99.9% of the time over the next year
```

SLO - Critical that there is buy-in from across the organisation. Make sure all stakeholders agree on the SLO. Everyone on the same team working towards the same goals. Developers, Contributors, Project Managers, SREs, Vice President

#### Make your SLOs achievable
- Based on past performance
  - Users expectations are strongly tied to past performance
- If no historical data, you need to collect some
- Keep in mind: measurement <> User satisfaction, and you may need to adjust your SLOs accordingly

#### In addition to achievable SLOs, you might have some aspirational SLOs
- Typically higher than your achievable SLO's
- Set a reasonable target and begin measuring
- Compare user feedback to SLOs


### Understanding SLAs

_"We've determined the level of availability with our SLIs and declared what target of availability we want to reach with our SLO's and now we need to describe what happens if we don't maintain that availability with an SLA"_

_"An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain"_

- Should reliability fail, there are consequences

#### SLA Characteristics

- A business-level agreement
  - SRE's are not usually involved with the drafting of SLA's except when setting up their SLIs and the corresponding SLOs
- Can be explicit or implied (implicit)
- Explicit contract contain consequences
  - Refund for services paid for
  - Service cost reduction on sliding scales
  - May be offered on a per service basis

41 SLAs in GCP
- Compute Engine
  - 4x 9's for Instance uptime in multiple zones
  - 99.5% for uptime for a single instance
  - 4x 9's for load balancing uptime
<br>
- If these aren't met, and customer does meet its obligations then the customer could be eligible to some financial credits

- Clear definitions for language

SLIs drives SLOs, SLOs inform the SLA 

- How does the SLO inform the SLA; example:

We want our SLO to be at or under 200ms, and therefore we want to set our SLA at a higher value e.g. 300ms, and beyond that, "hair on fire ms". You want to set your SLA significantly different to the SLO because you don't want to be consistently setting your customers hair on fire. Moreover your SLA's should be routinely achievable because you are working towards your objective (SLO) constantly.

To summarize:

SLO:
  - Internal targets that guide prioritization
  - Represents desired user experience
  - Missing objectives should have consequences

SLA
  - Set level just enough to keep customers
  - Incentizes minimum level of service
  - Looser than corresponding objectives

An SLI is an indicator of how things are at some particular point in time. Are things good or bad right now? - If our SLI doesn't always confidently tell us that, then it's not a good SLI. An SLO asks whether that indicator has been showing what we want it to for most of the time period that we care about. Has the SLI met our objective? In particular have things been the right amount of both good and bad? An SLA is our agreement of what happens if we don't meet our objective. What's our punishment for doing worse than we agreed we would. The SLI is like the speed of a car. It's travelling at some particular speed right now. The SLO is the speed limit at the upper end and the expected travel time for your trip at the lower end. You want to go approximately some speed over the course of your journey. The SLA is like getting a speeding ticket because you drove to fast, or drove too slowly.

### Making the Most of Risk


#### Setting Error Budgets

In SRE, an error budget is a good thing, and one of the unifying principals between developers and operations.

_"A quantitative measurement shared between the product and SRE teans to balance innovation and stability"_

The process for defining an error budget:

1. Management buys into the internal SLO
1. SLO used to determine the amount of uptime for that particular quarter
1. Monitoring services measures actual uptime
1. The difference between the expected SLO and the actual uptime is calculated
1. Then, if there's sufficient time within the error budget the new release is pushed forward

Error budgets seem like risky business, why not avoid risk at all costs? - 1 of the key DevOps pillars is that failure is unavoidable and therefore so is risk. SRE's manage service reliability, largely by managing risk.

- Balances innovation and reliability
An error budget provides a common incentive that allows both product development and SRE's to focus on finding the right balance between new features and service availability. 

- Manages release velocity
So as long as the systems SLO's are met, releases can continue. 

- Developers oversee own risk
When the error budget is large, product developers can take more risks because they have more time to spend. When the budget is almost over with, nearly drained, product developers themselves will push for more testing and slow down their release velocity, because they don't want to risk using the error budget and stalling their launch completely.

- What happens if the error budget exceeded:
  - Typically releases temporarily halted
  - Expansion in system testing and development
    - Which will overall improve performance

**Error Budget**

Error budget = 100% - SLO value

e.g.

SLO == 99.8%
100% - 99.8% = 0.2%

How will the 0.2% be applied over time?

0.2% == 0.002 (0.2% written as a decimal e.g. 1% == 0.01, 10% = 0.1 etc)
<br>
0.002 * 30 day/month * 24 hours/day * 60 minutes/hour = 86.4 minutes/month

> This may not seem like much, but this is actual downtime.

What about global services?

- Time-based error budgets are not valid because downtime is almost never universal. Typically an outage will only occur in one part of a system at a time, so it's; 
- Better to define availability in terms of request success rate
  - Referred to as aggregate availability, and the formula is a little different
    - Availability = Successful Requests / Total Requests
      - e.g. a System that serves 1M requests per day with 99.9% (3x 9's) availability, can serve up to 1000 errors and still hit it's target for that day

> 999000/1000000 = 0.999 == 99.9%

<br>
So, if we take A = S/T, S = A*T and change our target availability to 99.8%
<br>
100% - 99.8% = 0.2%, 0.2% (in decimal) == 0.002
<br>
Successful Requests (Really this is allowed errors based on target availability) = 0.002 * 1,000,000 = 2000 Errors, 1,000,000 - 2000 = 998,000 Successful requests
<br>

##### Error budgets, what are they good for?

1. Releasing new features
   1. Top use by the product team
2. Expected system changes
   1. roll out enhancements, good to know you are covered should something go wrong
3. Inevitable failure in networks, etc
4. Planned downtime
   1. e.g. take the entire system offline to implement a major upgrade
5. Risky experiments 
6. Unforseen circumstances (unknown unknowns) e.g. Global pandemic!

#### Defining and Reducing Toil

Toil: _"Work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows"_

1. Manual - This characteristic extends to include the running of a script - which, although saves time must still be run by hand
2. Repetitive - If a task is repeated multiple times, not just the once or twice, then the work is toil
3. Automatable - Should the task be done by a machine just as well as by a person, you can consider it toil
4. Tactical - Toil, by its very nature, is not proactive or strategy driven. Rather, it is reactive and interrupt-driven; e.g. pager alerts
5. Devoid of enduring value - Tasks that contribute to adding a permanent improvement to the service are not considered toil, but work that does not change the state, is.
6. Scales linearly as service grows - The best designed service can grow by at least one order of magnitude without change; tasks that scale up with service size or traffic are toil

What Toil isn't: Email, Expense Reports, Commuting, Meetings - None of these have the one qualification that a tasks needs to be labelled "Toil" - _"Not tied to a production service"_ - These are instead "Overhead"

Toil Reduction Benefits

- Increased engineering time
- Higher team morale, lower burnout
- Increased process standardization
- Enhanced team technical skills (automation)
- Fewer human error outages
- Shorter incident response times

3x Top Tips for Reducing Toil

1. Identify toil - Make sure you're differentiating it from overhead or actual engineering
2. Estimate the time to automate - Make sure the benefits outweigh the cost
3. Measure everything including context switching e.g. the time it takes you to switch to a new task and become involved in it

**Risk is at the core of SRE**

### Generating SRE Metrics

#### Monitoring Reliability

The best way to measure everything is to monitor everything

_"Collecting, processing, aggregating and displaying real-time quantitative data about a system, such as query counts and types, error counts and types, processing times and server lifetimes"_

Why monitor?

- Analyzing long-term trends
- Comparing over time groups
- Alerting (real-time)
- Exposing in dashboards
- Debugging
- Raw input for business analytics
- Security breach analysis

2x Different Types of Monitoring

| White-Box                                                | vs  | Black-Box                                                     |
| -------------------------------------------------------- | --- | ------------------------------------------------------------- |
| * Metrics exposed by the internals of the system         |     | * Testing externally visible behaviour as a user would see it |
| * Focus of predicting problems that encroach on your SLO |     | * Symptom-oriented, active problems                           |
| * Heavy use recommended                                  |     | * Moderate use of critical issues                             |
| * Best for detecting imminent issues                     |     | * Best for paging of incidents                                |

Metrics - Numerical measurements representing attributes and events

* GCP's Cloud Monitoring (Formerly Stack-driver)
* Collects a large number of metrics from every service at Google
* Provides much less granular information, but in near real-time
* Alerts and dashboards typically use metrics
* Real-time nature means engineers are notified of problems rapidly
* It's most critical to visualise the data in dashboards on the landing page

Logging - Append-only record of events

* GCP's Cloud Logging (Stack-driver logging)
* Can contain large volumes of highly granular data
  * Often difficult to sift through data to find what you're looking for
* Inherent delay between when an event occurs and when it is visible in logs
* Logs can be proccessed with a batch system, interrogated with ad hoc queries and visualised with dashboards
* Use logs to find the root cause of an issue, as the information needed is often not available as a metric
* For non-time-sensitive reporting, generate detailed reports using log processing systems
* Logs will nearly always produce more accurate data than metrics

#### Alerting Principals

_"Alerts give timely awareness to problems in your cloud applications so you can resolve the problems quickly"_

1. Set up monitoring
   - Conditions are continuously monitored
   - Monitoring can track SLOs
   - Can look for a missing metric
   - Can watch for thresholds

2. Track metrics over time
   - Track if condition persists for given amount of time
   - Time window (due to technical constraints) less than 24 hours

3. Notify when condition passed
   - Incident created and displayed
   - Alerts can be sent via: Email, Text message, apps (slack), pub/sub etc

How do you know when to setup an alert?

- A key factor is how fast you're burning your error budget
  - Error Budget Burn Rate = 100% - SLO * (Events over set time)
  - Example: If your SLO goal is 98%, then it's acceptable for 2% of the events measured by your SLO to fail, before your SLO goal is missed. And this will tell you the burn rate, which is how fast you're consuming your error budget. A burn rate of > 1 indicates that if the currently measured error rate is sustained over any future compliance period, the service will be out of SLO for that period. Now to combat that, you need a burn rate alerting policy which will notify you when your error budget is consumed faster than the threshold you defined, as measured over that alerts compliance period. 

100% - 98% = 2% = 0.002  
0.002 * (12,000 fails/30 days) = BurnRate  
0.002 * 400 = 0.8 - Because this is < 1, we will still have some room left in our error budget (400x fails per day)  
400/24 hours = 16.67 fails/hour - Maybe bring this down to a 5 minute period, that you'd want to watch in an alerting period.  

Slow Burn Alerting Policy

- Warns that rate of consumption could exhaust error budget before the end of the compliance period
- Less urgent than fast-burn condition
- Requires longer lookback period (24 hours max)
  - Determines how far back in time you're retrieving data
- Threshold should be slightly higher than the baseline

Fast Burn Alerting Policy

- Warns of a sudden, large change in consumption that, if uncorrected, will exhaust error budget quickly
- Shorter lookback period (i.e. 1-2 hours / 1-2 mins) recommended
- Set threshold much higher than the baseline, e.g. 10x to avoid overly sensitive alerts

Establishing an SLO alerting Policy

1. Select SLO to monitor
   - Choose the desired SLO; its best to only monitor one SLO at a time
2. Construct a condition for alerting policy
   - It's likely you'll have multiple conditions for every alerting policy, such as one for slow burn and another for fast burn
3. Identify notification channel
   - Multiple notification channels are possible including email, SMS, pager, app, webhook, pub/sub
4. Provide documentation
   - This is an optional but highly recommended step that provides your team with the information about the alert which can help them resolve the issue
5. Create alerting policy
   - Bring all the pieces together to complete your alerting policy in either the console or via the API.

#### Investigating SRE Tools

Dev Tools

- Kubernetes Engine - A managed production-ready environment for running containerized applications
- Container Registry - Single place for your team to securely manage container images used by kubernetes
- Cloud Build - Automates the proccess. A service that executes your builds in a series of steps where each step is run in a container

- Cloud Source Repositories - Fully managed private Git respositories with integrations for CI, delivery and deployment
- Spinnaker for Google Cloud - Integrates Spinnaker with other GCP services, allowing you to extend your CI/CD pipeline and integrate security and compliance in the process

Ops Tools

- Cloud Monitoring - Tracking metrics, provides visibility into the performance, uptime and overall health of cloud-powered applications
- Cloud Logging - Allows you to store, search, analyze, monitor and alert on log data and events from Google Cloud
- Cloud Debugger - Lets you inspect the state of a running application in real time, without stopping or slowing it down
- Cloud Trace - A distributed tracing system that collects latency data from your application and displays it in the console
- Cloud profiler - Continuously gathers CPU usage and memory-allocation information from your production applications

Metric always displayed in an aggregate sense - How many events happened during this time period. Metric event is not about events, but about status, like "the system was handling 1200 connection requests per second @ 12:05:58. Latency of metrics and logging information, the time that it takes for the data to flow from the part of the system that has the information to the monitoring system where you can see that show up. Why isn't that instantaneous?, What happens in between? This is a good exercise to go through because not only will you then understand this measurement part of SRE much better, you'll also be able to debug things when your own measurements have some trouble. What might have gone wrong and could be blocking the flow of information?  You can turn logs into metrics with filters, looking for certain types of events and then have metrics that show how many you're getting, sometimes even alert on those events. Log based metrics will always have additional latency.

### Reacting to Incidents

#### Handling Incident Response

Poorly managed incident response

Engineer on duty freaks out because DC gone down, ops team decide to roll-back to previous version (they're just shooting from the hip). 3rd DC down, VP gets angry customer calls and it's her first knowledge of the problem, Simon (know it all) even though he's not on-call and not communicating with anyone else, thinks he can fix the issue and rolls out his fix, more DC's go down. What's wrong here? - Need to step back from trying to find the technical solution. Poor communication to the VP, freelancing agents shouldn't have access

Well manage incident response

Engineer on duty Clair; Message received, delegates Alex as incident commander, Alex brings in Ravi as part of operations team, they don't fix within their timezone, and pass on to new incident commander, along the way, they've been contributing to an incident report keeping the vice-president informed, able to sign off on some customer messaging for complete transparency. Simon is not needed and works on his side project

Initiates protocol and appoints incident commander. Works with one operations team and passes control at end of day if issue persists. VP and all stakeholders informed at start of incident in the loop, can coordinate public response. Freelance agent not wanted, called if necessary

1. Separation of Responsibilities
   - Specific roles should be designated to team members, each with full autonomy in their role.

Roles should include:
- Incident commander
  - Person in charge during incident, designating responsibilities and taking all roles not designated
- Operational Team
  - Personnel designated with actual responses to incident. Only people that are authorized to take any action, e.g. rolling out fixes
- Communication Lead
  - Public face of incident response, reponsible for issuing updates to stakeholders
- Planning Lead
  - Supports Ops team with long-term actions, such as filing bugs, arranging hand-off if necessary and tracking system changes

2. Established Command Post
   - The "post" could be a physical location or, more likely in a large company, a communication venue such as a slack channel
<br>
3. Live Incident State Document
   - A shared document that reflects the current state of the incident, updated as necessary and retained for postmortem
<br>
4. Clear, Real-time Handoff
   - If the day is ending and the issue remains unresolved, an explicit handoff to another incident commander must take place

**3x Questions to Determine an Incident**

If yes to any of these questions:

1. Need another team?
2. Outage visible to users?
3. Unresolved after an hour?

Incident Management Best Practices

- Develop and document procedures
- Prioritize damage and restore service - Take care of the biggest issues first
- Trust team members - Give team autonomy they need without second guessing
- If overwhelmed, get help
- Consider response alternatives
- Practice procedure routinely
- Share the lead in all the roles - Rotate roles among team members, so everyone gains the experience they need to manage the incident

#### Managing Service Lifecycle

How does an SRE view the lifecycle of a service?

SRE Engagement Over Service Lifecycle **(Graph)**

**Architecture and Development (1st of 5 stages)**

- Peaks higher than at any other point during the lifecycle
  - Implementing Best Practices with the Dev Team
  - Recommend best infrastructure systems
  - Co-design part of service with dev team
  - Avoid costly re-designs

**Active Development**

- SREs begin productionizing the service
- Planning for capacity
- Adding resources for redundancy
- Planning for spike and overloads
- Implementing load balancing
- Adding monitoring, alerting and performance tuning that will become so important when developing their SLI's & SLOs

**Limited Availability (Alpha and Beta programmes)**

- Measure increasing performance load on the service (Begin to measure and track their SLI's)
- Evaluate reliability 
- Define SLOs which will lead to SLA specifics
- Build capacity models
- Establish incident responses shared with dev team so everyone is on the same page and knows common tactics to take when problems arise

**General Availability (hopefully the longest stage)**

- Production Readiness Review passed 
- SREs handle majority of op work
- Incident responses
- Track operational load and SLOs and making sure that everything is in accordance so the error budget is not exhausted and new features can be rolled out in a timely fashion

**Depreciation**

- SREs operate existing system
- Support transistion with Dev team
- Work with dev team on designing new system; adjust staffing accordingly

_"SRE principals aim to maximize the engineering velocity of developer teams while keeping products reliable"_

#### Ensuring Healthly Operations Collaboration

Post-mortem or Retrospective

What a postmortem is not:

- It's not a funeral
- It's not a party

**A postmortem is an investigation**

- Get metadata
  - What systems were affected?
  - What personnel were involved?
  - Include machine readable data:
    - Time to identify
    - Time to act
    - Time to resolve
<br>
- Recreate timeline
  - When and how was the incident reported?
  - When did the response start?
  - When and how did we make it better?
  - When was it over?
<br>
- Generate report
  - Report will be initiated by the incident commander
  - All participants need to add their own details on actions taken
  - Was there anything done that needs to be rolled back?


**Remember: NO BLAME**

- No one is at fault
- No one will be shamed
- No one will be fired
- Everyone learns

Production Meeting Collaboration

1. Upcoming production changes
   - Default to enabling change, which requires tracking the useful properties of that change: start time, duration, expected effect and so on. This is called near-term horizon visibility
<br>
2. Metrics
   - Review current SLOs, even if they are in line. Track how latency figures, CPU utilization figures, etc.. change over time
<br>
3. Outages
   - The big picture portion of the meeting can be devoted to a synopsis of the postmortem or working on the process
<br>
4. Paging Events
   - The tactical view: the list of pages, who was pages, what happened then, and so on. Two primary questions: should that alert have paged the way it did, and should it have paged at all?
<br>
5. Nonpaging Event `#1`
   - What events didn't get paged, but probably should have?
<br>
6. Nonpaging Event `#2` and `#3`
   - What events occured that are not pageable and require attention? What events are not pageable and do not require attention?

#### 

[Class SRE Implements DevOps](https://www.youtube.com/playlist?list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj)