gpcdesre/Part_2.md
2020-10-23 11:30:49 +01:00

9.4 KiB

Big Picture - What is SRE?

5 Key Pillars of DevOps + SRE

  • Reduce organisational silos
    • Bridge teams together
    • Increase communication
    • Shared company vision

Share ownership

  • Developers + Operations
  • Implement same tooling
  • share same techniques

  • Accept failure as normal
    • Try to anticipate, but
    • Incidents are bound to occur
    • Failure help team learn

No-fault post mortems & SLOs - No two failures the same (goal) - Track incidents (SLI's) - Map to objectives (SLOs)


  • Implement gradual change
    • Continuous change culture
    • Small updates are better
    • Easier to review
    • Easier to rollback

Redcue costs of failures

  • Limited "canary" rollouts
  • Impact fewest users
  • Automate where possible for further cost reduction

  • Leverage Tooling and automation
    • Reduce manual tasks
    • The heat of the cI/CD pipelines
    • Fosters speed & consistency

Automate this years job away

  • Automation is a force multiplier
  • Autonomous automation best
  • Centralizes mistakes

  • Measure Everything
    • Critical guage of sucess
    • CI/CD needs full monitoring
    • Synthetic, proactive monitoring

Measure toil and reliability

  • Key to SLOs and SLAs
  • Reduce toil (aka repetitive manual labour!), up engineering
  • Monitor all over time

Why "Reliability"

  • Most important: does the product work?
  • Reliability is the absense of errors
  • Unstable service likely indicates a variety of issues
  • Must attend to reliability all the time

(class SRE) = The how implements (DevOps) = The What

Make better software, faster

Understanding SLIs

SRE breaks down into 3 distinct functions

  1. Define availability
    1. SLO:
  2. Determine level of availability
    1. SLI - Quantifiable measure of reliability; Metrics over time, specific to a user journey, such as request/reponse, data processing or storage. Examples:
      1. Request latency - How long it takes to return a response to a request
      2. Failure Rate - A fraction of all rates received: (unsuccessful requests/all requests)
      3. Batch thoughput - Proportion of time = data processing > than a threshold
  3. Plan in case of failure
    1. SLA

Each maps to a key component; SLO, SLI, SLA

What's a User Journey?

  • Sequence of tasks central to user experience and crucial to service
    • e.g. Online shopping journeys
      • Product search
      • Add to cart
      • Checkout

Request/Response Journey:

  • Availability - Proportion of valid requests served successfully
  • Latency - Proportion of valid requests served faster than a threshold
  • Quality - Proportion of valid requests served maintaining quality

None of these map specifically to a user journey, however they are all part of that

Data processing journey: Might include a different set of SLIs

  • Freshness - Proportion of valid data updated more recently than a threshold
  • Correctness - Proportion of valid data producing correct output
  • Throughput - Proportion of time where the data processing rate is faster than a threshold
  • Coverage - Proportion of valid data processed successfully

Google's 4 Golden Signals

  • Latency - The time is takes for your service to fulfill a request
  • Errors - The rate at which your service fails
  • Traffic - How much demand is directed at your service
  • Saturation - A measure of how close to fully utilized the service's resources are

Transparent SLI's within GCP Dashboard - API's & Services

The SLI Equation:

SLI = (Good Events / Valid Events) * 100

Valid - Known bad events are excluded from the SLI e.g. 400 http

Bad SLI - Variance and overlap in metrics prior to and during outages are problematic; graph contains up and down spikes during an outage Good SLI - Stable signal with a strong correlation to outage is best; graph is smooth.

SLI Best Practices

  1. Limit number of SLIs

    • 3-5 per user journey
    • Too many increase difficulty for operators
    • Can lead to contradictions
  2. Reduce complexity

    • Not all metrics make good SLIs
    • Increased response time
    • Many false positive
  3. Prioritize Journeys

    • Select most valuable to users
    • Identify user-centric events
  4. Aggregate similar SLIs

    • Collect data over time
    • Turn into a rate, average, or percentile
  5. Bucket to distinguish response classes

    • Not all request are same
    • Requesters may be human, background apps or bots
    • Combine (or "bucket") for better SLIs
  6. Collect data at load balancer

    • Most efficient method
    • Closer to users's experience

Understanding SLOs

"SLO's specify a target level for the reliability of your service"

The First rule of SLOs: 100 % reliability is not a good objective

Why?

  • Trying to reach 100%, 100% of the time, is very expensive in terms of resources
  • Much more technically complex
  • Users don't need 100% to be acceptable (get close enough where users don't notice the difference, just reliable enough)
  • Less than 100% leaves room for new features, as you have resources remaining to develop (error budgets)

SLOs are tied directly to SLIs

  • Measured by SLI
  • Can be a single target value or range of values
    • e.g. SLI <= SLO or
    • (lower bound <= SLI <= upper bound) = SLO
    • Common SLOs: 99.5%, 99.9% (3x 9's), 9.999% (4x 9's)

SLI - Metrics over time which detail the health of a service

Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile

SLO - Agreed-upon bounds on how often SLIs must be met

95% percentile homepage SLI will sucess 99.9% of the time over the next year

SLO - Critical that there is buy-in from across the organisation. Make sure all stakeholders agree on the SLO. Everyone on the same team working towards the same goals. Developers, Contributors, Project Managers, SREs, Vice President

Make your SLOs achievable

  • Based on past performance
    • Users expectations are strongly tied to past performance
  • If no historical data, you need to collect some
  • Keep in mind: measurement <> User satisfaction, and you may need to adjust your SLOs accordingly

In addition to achievable SLOs, you might have some aspirational SLOs

  • Typically higher than your achievable SLO's
  • Set a reasonable target and begin measuring
  • Compare user feedback to SLOs

Understanding SLAs

"We've the determined the level of availability with our SLIs and declared what target of availability we want to reach with our SLO's and now we need to describe what happens if we don't maintain that availability with an SLA"

"An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain"

  • Should reliability fail, there are consequences

SLA Characteristics

  • A business-level agreement
    • SRE's are not usually involved with the drafting of SLA's except when setting up their SLIs and the corresponding SLOs
  • Can be explicit or implied (implicit)
  • Explicit contract contain consequences
    • Refund for services paid for
    • Service cost reduction on sliding scales
    • May be offered on a per service basis

41 SLAs in GCP

  • Compute Engine

    • 4x 9's for Instance uptime in multiple zones
    • 99.5% for uptime for a single instance
    • 4x 9's for load balancing uptime
  • If these aren't met, and customer does meet its obligations then the customer could be eligible to some financial credits

  • Clear definitions for language

SLIs drives SLOs, SLOs inform the SLA

  • How does the SLO inform the SLA; example:

We want our SLO to be at or under 200ms, and we therefore we want to set our SLA at a higher value e.g. 300ms, and beyond that, "hair on fire ms". You want to set your SLA significantly different to the SLO because you don't want to be consistently setting your customers high on fire. Moreover your SLA's should be routinely achievable because you are working towards your objective (SLO) constantly.

To summarize:

SLO:

  • Internal targets that guide prioritization
  • Represents desired user experience
  • Missing objectives should have consequences

SLA

  • Set level just enough to keep customers
  • Incentizes minimum level of service
  • Looser than corresponding objectives

An SLI is an indicator of how things are at some particular point in time. Are things good or bad right now? - If our SLI doesn't always confidently tell us that, then it's not a good SLI. An SLO asks whether that indicator has been showing what we want it to for most of the time period that we care about. Has the SLI met our objective? In particular have things been the right amount of both good and bad? An SLA is our agreement of what happens if we don't meet our objective. What's our punishment for doing worse than we agreed we would. The SLI is like the speed of a car. It's travelling at some particular speed right now. The SLO is the speed limit at the upper end and the expected travel time for your trip at the lower end. You want to go approximately some speed over the course of your journey. The SLA is like getting a speeding ticket because you drove to fast, or driving too slowly.

Making the Most of Risk

Setting Error Budgets

Defining and Reducing Toil

Generating SRE Metrics

Monitoring Reliability

Alerting Principals

Investigating SRE Tools

Reacting to Incidents

Handling Incident Response

Managing Service Lifecycle

Ensuring Healthly Operations Collaboration