gpcdesre/Part_2.md

238 lines
7.5 KiB
Markdown
Raw Normal View History

2020-10-23 08:58:26 +00:00
Big Picture - What is SRE?
### 5 Key Pillars of DevOps + SRE
- Reduce organisational silos
- Bridge teams together
- Increase communication
- Shared company vision
**Share ownership**
- Developers + Operations
- Implement same tooling
- share same techniques
-------------------------------------
- Accept failure as normal
- Try to anticipate, but
- Incidents are bound to occur
- Failure help team learn
**No-fault post mortems & SLOs**
- No two failures the same (goal)
- Track incidents (SLI's)
- Map to objectives (SLOs)
--------------------------------------
- Implement gradual change
- Continuous change culture
- Small updates are better
- Easier to review
- Easier to rollback
**Redcue costs of failures**
- Limited "canary" rollouts
- Impact fewest users
- Automate where possible for further cost reduction
-------------------------------------
- Leverage Tooling and automation
- Reduce manual tasks
- The heat of the cI/CD pipelines
- Fosters speed & consistency
**Automate this years job away**
- Automation is a force multiplier
- Autonomous automation best
- Centralizes mistakes
-------------------------------------
- Measure Everything
- Critical guage of sucess
- CI/CD needs full monitoring
- Synthetic, proactive monitoring
Measure toil and reliability
- Key to SLOs and SLAs
- Reduce toil (aka repetitive manual labour!), up engineering
- Monitor all over time
-------------------------------------
Why "Reliability"
- Most important: does the product work?
- Reliability is the absense of errors
- Unstable service likely indicates a variety of issues
- Must attend to reliability all the time
(class SRE) = The how implements (DevOps) = The What
> Make better software, faster
### Understanding SLIs
SRE breaks down into 3 distinct functions
1. Define availability
1. SLO:
2. Determine level of availability
1. SLI - Quantifiable measure of reliability; Metrics over time, specific to a user journey, such as request/reponse, data processing or storage. Examples:
1. Request latency - How long it takes to return a response to a request
2. Failure Rate - A fraction of all rates received: (unsuccessful requests/all requests)
3. Batch thoughput - Proportion of time = data processing > than a threshold
3. Plan in case of failure
1. SLA
> Each maps to a key component; SLO, SLI, SLA
What's a User Journey?
* Sequence of tasks central to user experience and crucial to service
* e.g. Online shopping journeys
* Product search
* Add to cart
* Checkout
Request/Response Journey:
* Availability - Proportion of valid requests served successfully
* Latency - Proportion of valid requests served faster than a threshold
* Quality - Proportion of valid requests served maintaining quality
> None of these map specifically to a user journey, however they are all part of that
Data processing journey: Might include a different set of SLIs
* Freshness - Proportion of valid data updated more recently than a threshold
* Correctness - Proportion of valid data producing correct output
* Throughput - Proportion of time where the data processing rate is faster than a threshold
* Coverage - Proportion of valid data processed successfully
Google's 4 Golden Signals
* Latency - The time is takes for your service to fulfill a request
* Errors - The rate at which your service fails
* Traffic - How much demand is directed at your service
* Saturation - A measure of how close to fully utilized the service's resources are
Transparent SLI's within GCP Dashboard - API's & Services
**The SLI Equation:**
SLI = (Good Events / Valid Events) * 100
Valid - Known bad events are excluded from the SLI e.g. 400 http
**Bad SLI** - Variance and overlap in metrics prior to and during outages are problematic; graph contains up and down spikes during an outage
**Good SLI** - Stable signal with a strong correlation to outage is best; graph is smooth.
#### SLI Best Practices
1. Limit number of SLIs
* 3-5 per user journey
* Too many increase difficulty for operators
* Can lead to contradictions
2. Reduce complexity
* Not all metrics make good SLIs
* Increased response time
* Many false positive
3. Prioritize Journeys
* Select most valuable to users
* Identify user-centric events
4. Aggregate similar SLIs
* Collect data over time
* Turn into a rate, average, or percentile
5. Bucket to distinguish response classes
* Not all request are same
* Requesters may be human, background apps or bots
* Combine (or "bucket") for better SLIs
6. Collect data at load balancer
* Most efficient method
* Closer to users's experience
### Understanding SLOs
"SLO's specify a target level for the reliability of your service"
The First rule of SLOs: 100 % reliability is not a good objective
Why?
- Trying to reach 100%, 100% of the time, is very expensive in terms of resources
- Much more technically complex
- Users don't need 100% to be acceptable (get close enough where users don't notice the difference, just reliable enough)
- Less than 100% leaves room for new features, as you have resources remaining to develop (error budgets)
SLOs are tied directly to SLIs
- Measured by SLI
- Can be a single target value or range of values
- e.g. SLI <= SLO
or
- (lower bound <= SLI <= upper bound) = SLO
- Common SLOs: 99.5%, 99.9% (3x 9's), 9.999% (4x 9's)
SLI - Metrics over time which detail the health of a service
```
Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile
```
SLO - Agreed-upon bounds on how often SLIs must be met
```
95% percentile homepage SLI will sucess 99.9% of the time over the next year
```
SLO - Critical that there is buy-in from across the organisation. Make sure all stakeholders agree on the SLO. Everyone on the same team working towards the same goals. Developers, Contributors, Project Managers, SREs, Vice President
#### Make your SLOs achievable
- Based on past performance
- Users expectations are strongly tied to past performance
- If no historical data, you need to collect some
- Keep in mind: measurement <> User satisfaction, and you may need to adjust your SLOs accordingly
#### In addition to achievable SLOs, you might have some aspirational SLOs
- Typically higher than your achievable SLO's
- Set a reasonable target and begin measuring
- Compare user feedback to SLOs
### Understanding SLAs
_"We've the determined the level of availability with our SLIs and declared what target of availability we want to reach with our SLO's and now we need to describe what happens if we don't maintain that availability with an SLA"_
_"An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain"_
- Should reliability fail, there are consequences
#### SLA Characteristics
- A business-level agreement
- SRE's are not usually involved with the drafting of SLA's except when setting up their SLIs and the corresponding SLOs
- Can be explicit or implied (implicit)
- Explicit contract contain consequences
- Refund for services paid for
- Service cost reduction on sliding scales
- May be offered on a per service basis
41 SLAs in GCP
- Compute Engine
- 4x 9's for Instance uptime in multiple zones
- 99.5% for uptime for a single instance
- 4x 9's for load balancing uptime
- If these aren't met, and customer does meet its obligations then the customer could be eligible to some financial credits
- Clear definitions for language
SLIs drives SLOs, SLOs inform the SLA
- How does the SLO inform the SLA; example: