gpcdesre/Part_2.md at 46c836a2842e065b4845dd7472da6a86f9aec0ca

Training/gpcdesre

Fork 0

Soulmanos 46c836a284 Section completed

2020-10-23 11:30:49 +01:00

9.4 KiB

Raw Blame History

Big Picture - What is SRE?

5 Key Pillars of DevOps + SRE

Reduce organisational silos
- Bridge teams together
- Increase communication
- Shared company vision

Share ownership

Developers + Operations
Implement same tooling
share same techniques

Accept failure as normal
- Try to anticipate, but
- Incidents are bound to occur
- Failure help team learn

No-fault post mortems & SLOs - No two failures the same (goal) - Track incidents (SLI's) - Map to objectives (SLOs)

Implement gradual change
- Continuous change culture
- Small updates are better
- Easier to review
- Easier to rollback

Redcue costs of failures

Limited "canary" rollouts
Impact fewest users
Automate where possible for further cost reduction

Leverage Tooling and automation
- Reduce manual tasks
- The heat of the cI/CD pipelines
- Fosters speed & consistency

Automate this years job away

Automation is a force multiplier
Autonomous automation best
Centralizes mistakes

Measure Everything
- Critical guage of sucess
- CI/CD needs full monitoring
- Synthetic, proactive monitoring

Measure toil and reliability

Key to SLOs and SLAs
Reduce toil (aka repetitive manual labour!), up engineering
Monitor all over time

Why "Reliability"

Most important: does the product work?
Reliability is the absense of errors
Unstable service likely indicates a variety of issues
Must attend to reliability all the time

(class SRE) = The how implements (DevOps) = The What

Make better software, faster

Understanding SLIs

SRE breaks down into 3 distinct functions

Define availability
1. SLO:
Determine level of availability
1. SLI - Quantifiable measure of reliability; Metrics over time, specific to a user journey, such as request/reponse, data processing or storage. Examples:
  1. Request latency - How long it takes to return a response to a request
  2. Failure Rate - A fraction of all rates received: (unsuccessful requests/all requests)
  3. Batch thoughput - Proportion of time = data processing > than a threshold
Plan in case of failure
1. SLA

Each maps to a key component; SLO, SLI, SLA

What's a User Journey?

Sequence of tasks central to user experience and crucial to service
- e.g. Online shopping journeys
  - Product search
  - Add to cart
  - Checkout

Request/Response Journey:

Availability - Proportion of valid requests served successfully
Latency - Proportion of valid requests served faster than a threshold
Quality - Proportion of valid requests served maintaining quality

None of these map specifically to a user journey, however they are all part of that

Data processing journey: Might include a different set of SLIs

Freshness - Proportion of valid data updated more recently than a threshold
Correctness - Proportion of valid data producing correct output
Throughput - Proportion of time where the data processing rate is faster than a threshold
Coverage - Proportion of valid data processed successfully

Google's 4 Golden Signals

Latency - The time is takes for your service to fulfill a request
Errors - The rate at which your service fails
Traffic - How much demand is directed at your service
Saturation - A measure of how close to fully utilized the service's resources are

Transparent SLI's within GCP Dashboard - API's & Services

The SLI Equation:

SLI = (Good Events / Valid Events) * 100

Valid - Known bad events are excluded from the SLI e.g. 400 http

Bad SLI - Variance and overlap in metrics prior to and during outages are problematic; graph contains up and down spikes during an outage Good SLI - Stable signal with a strong correlation to outage is best; graph is smooth.

SLI Best Practices

Limit number of SLIs
- 3-5 per user journey
- Too many increase difficulty for operators
- Can lead to contradictions
Reduce complexity
- Not all metrics make good SLIs
- Increased response time
- Many false positive
Prioritize Journeys
- Select most valuable to users
- Identify user-centric events
Aggregate similar SLIs
- Collect data over time
- Turn into a rate, average, or percentile
Bucket to distinguish response classes
- Not all request are same
- Requesters may be human, background apps or bots
- Combine (or "bucket") for better SLIs
Collect data at load balancer
- Most efficient method
- Closer to users's experience

Understanding SLOs

"SLO's specify a target level for the reliability of your service"

The First rule of SLOs: 100 % reliability is not a good objective

Why?

Trying to reach 100%, 100% of the time, is very expensive in terms of resources
Much more technically complex
Users don't need 100% to be acceptable (get close enough where users don't notice the difference, just reliable enough)
Less than 100% leaves room for new features, as you have resources remaining to develop (error budgets)

SLOs are tied directly to SLIs

Measured by SLI
Can be a single target value or range of values
- e.g. SLI <= SLO or
- (lower bound <= SLI <= upper bound) = SLO
- Common SLOs: 99.5%, 99.9% (3x 9's), 9.999% (4x 9's)

SLI - Metrics over time which detail the health of a service

Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile

SLO - Agreed-upon bounds on how often SLIs must be met

95% percentile homepage SLI will sucess 99.9% of the time over the next year

SLO - Critical that there is buy-in from across the organisation. Make sure all stakeholders agree on the SLO. Everyone on the same team working towards the same goals. Developers, Contributors, Project Managers, SREs, Vice President

Make your SLOs achievable

Based on past performance
- Users expectations are strongly tied to past performance
If no historical data, you need to collect some
Keep in mind: measurement <> User satisfaction, and you may need to adjust your SLOs accordingly

In addition to achievable SLOs, you might have some aspirational SLOs

Typically higher than your achievable SLO's
Set a reasonable target and begin measuring
Compare user feedback to SLOs

Understanding SLAs

"We've the determined the level of availability with our SLIs and declared what target of availability we want to reach with our SLO's and now we need to describe what happens if we don't maintain that availability with an SLA"

"An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain"

Should reliability fail, there are consequences

SLA Characteristics

A business-level agreement
- SRE's are not usually involved with the drafting of SLA's except when setting up their SLIs and the corresponding SLOs
Can be explicit or implied (implicit)
Explicit contract contain consequences
- Refund for services paid for
- Service cost reduction on sliding scales
- May be offered on a per service basis

41 SLAs in GCP

Compute Engine
- 4x 9's for Instance uptime in multiple zones
- 99.5% for uptime for a single instance
- 4x 9's for load balancing uptime
If these aren't met, and customer does meet its obligations then the customer could be eligible to some financial credits
Clear definitions for language

SLIs drives SLOs, SLOs inform the SLA

How does the SLO inform the SLA; example:

We want our SLO to be at or under 200ms, and we therefore we want to set our SLA at a higher value e.g. 300ms, and beyond that, "hair on fire ms". You want to set your SLA significantly different to the SLO because you don't want to be consistently setting your customers high on fire. Moreover your SLA's should be routinely achievable because you are working towards your objective (SLO) constantly.

To summarize:

SLO:

Internal targets that guide prioritization
Represents desired user experience
Missing objectives should have consequences

SLA

Set level just enough to keep customers
Incentizes minimum level of service
Looser than corresponding objectives

An SLI is an indicator of how things are at some particular point in time. Are things good or bad right now? - If our SLI doesn't always confidently tell us that, then it's not a good SLI. An SLO asks whether that indicator has been showing what we want it to for most of the time period that we care about. Has the SLI met our objective? In particular have things been the right amount of both good and bad? An SLA is our agreement of what happens if we don't meet our objective. What's our punishment for doing worse than we agreed we would. The SLI is like the speed of a car. It's travelling at some particular speed right now. The SLO is the speed limit at the upper end and the expected travel time for your trip at the lower end. You want to go approximately some speed over the course of your journey. The SLA is like getting a speeding ticket because you drove to fast, or driving too slowly.

9.4 KiB

Raw Blame History

5 Key Pillars of DevOps + SRE

Understanding SLIs

SLI Best Practices

Understanding SLOs

Make your SLOs achievable

In addition to achievable SLOs, you might have some aspirational SLOs

Understanding SLAs

SLA Characteristics

Making the Most of Risk

Setting Error Budgets

Defining and Reducing Toil

Generating SRE Metrics

Monitoring Reliability

Alerting Principals

Investigating SRE Tools

Reacting to Incidents

Handling Incident Response

Managing Service Lifecycle

Ensuring Healthly Operations Collaboration

9.4 KiB Raw Blame History

5 Key Pillars of DevOps + SRE

Understanding SLIs

SLI Best Practices

Understanding SLOs

Make your SLOs achievable

In addition to achievable SLOs, you might have some aspirational SLOs

Understanding SLAs

SLA Characteristics

Making the Most of Risk

Setting Error Budgets

Defining and Reducing Toil

Generating SRE Metrics

Monitoring Reliability

Alerting Principals

Investigating SRE Tools

Reacting to Incidents

Handling Incident Response

Managing Service Lifecycle

Ensuring Healthly Operations Collaboration

9.4 KiB

Raw Blame History