gpcdesre/qwiklabs_1.md
2021-02-15 17:03:31 +00:00

7.6 KiB

Introduction to SRE

  1. Reliability is the most important feature
  2. Users, not monitoring decide reliability
  3. Well-engineered...
    1. software = 99.9%
    2. operations = 99.99%
    3. business = 99.999%

Each additional "9" improves reliability by 10 times, but rough rule of thumb, costs business 10x more

Best to think of downtime or reliability in the inverse e.g. How much downtime is permissiable within a given timespan (error budget)

28-day error budget 99.9% = 40 minutes - Human manageable e.g. Humans see alerts, respond and fix them 99.99% = 4 minutes - System needs to detect and self-heal complete outages, because not enough time to loop in a human 9.999% = 24 seconds - Restrict rate of change, margin for error is tiny, would probably need to rebuild monitoring system from the ground up, because metrics not available for the timescale

Reliability of a system is it's most important feature

SLO: Service Level Objective

  • If reliability is a feature, when do you prioritize it vs other features?

    • Useful for Product owners and Execs
    • Setting a target allows all parts of organisation ability to determine if the system is reliable or not
    • Acknowledging that a specific quantity of unreliability is acceptable provides a budget for failure that can be spent on developing and launching new features the remaining budget provides a signal to feed into your planning cycles to ensure work to improve reliability is prioritized
    • Everyone must agree that the target accurately represents the desired experience of your users
  • A problem with building new features quickly is that there's often a strong negative correlation between development velocity and system reliability

    • A missed reliability target signals when too many things users care about have been broken by excessive development velocity
      • SLO's can help development teams answer is when moving fast and breaking things how fast is too fast?
    • If everyone agrees the SLO represents the point at which you are no longer meeting the expectations of your users then broadly speaking being well within SLO is a signal that you can move faster without causing those users pain conversely burning most or in the worst cases multiples of your error budget means you have to lift your foot off the accelerator
      • If you can afford to pay the reliability cost of a specific risk from your budget you don't have to spend engineering effort mitigating or eliminating that risk

How SLOs help you balance operational and project work:

  • For those whose main task is to ensure that a system is operating reliably a major concern is that the operational work; firefighting, incident response, repetitive maintenance or upkeep tasks, crowd out the project work so vital to ensure that the service doesn't catch fire in the first place. Getting caught in this catch-22 scenario leads to burnout, pager fatigue and demoralization but if you know you're meeting your reliability targets you can break out of this cycle of reactive response
  • The main question SLO's can help operations teams answer is what is the right level of reliability for the system you support?
  • A common root cause for operational overload is that there is disagreement over the answer to this question or that it has never been discussed in the first place
  • Development organizations prioritize building new features over improving the reliability of past ones and often there is little executive support for changing these priorities because the cost of unreliability is not immediately obvious
  • If your slos have executive backing and development teams have committed to meeting them they turn drawn out arguments about prioritization into data-driven decisions.
    • Even better an SLO can drive short-term operational response as well as long-term prioritization if your system starts burning its error budget at an elevated rate over a short time horizon that is a strong signal users are having a bad time and someone should investigate. On the other hand if there's no obvious harm to your users then it may be that you can ignore a low rate of errors and other negative operational signals and get back to your project work

Making SLO's work for your organisation

  • For slos to work all parts of the business must agree that they are an accurate measure of user experience and to use them as the primary driver for decision making
  • Being out of SLO must have concrete well-documented consequences that redirect engineering effort within the organization towards making reliability improvements this in turn requires strong executive support for operations teams to enforce these consequences
    • Creating a sense of shared ownership where developers feel they have a shared responsibility to make the service reliable and the operations team feel that they have a responsibility to help new features reach users as quickly as possible is crucial

Targeting Reliability

  • 3x Principals
    • What to promise and to whom
    • What metrics to measure that you care about and make your service "good"
    • How much reliability is "good enough"

SLO's vs SLA's

SLA: Service Level Agreement - Agreements with customers about the reliability of your service.

  • It has to have consequences if it's violated, otherwise there's no point in making one
    • e.g. partial refunds or extra service credits

SLO: Service Level Objectives - Thresholds that catch an issue before it breaches the SLA

  • Should always be stronger that your SLA's because the customer is almost always impacted before the SLA is breached
  • Violating SLA costs money!

SLA is an external promise, an SLO is an internal promise to meet customer expectations

keep in mind that when you do violate your SLO's it suddenly becomes really important to no longer have more outages. So you'll want to take steps to remove risks from your service. That means slowing down the rate of change to the system and eliminating risks either by doing fewer pushes and devoting engineering and automation efforts to reducing and eliminating areas of risks.

What is a "reliable" service?

  • The Happiness Test
    • Users perceive a service to be unreliable when it fails to meet their expectations whatever those may be
      • Users whose expectations have not been met tend to get grumpy
      • If your service is performing exactly at its target slos your average user would be happy with that performance if it were any less reliable you'd no longer be meeting their expectations
      • The challenge is quantifying and measuring the happiness of your customer since you can't do this directly
      • Make sure you're thinking about all groups of your customers what people are using mobile apps versus folks with a desktop browser or those in a completely different continent or market altogether

Measuring Reliability (e.g. Netflix)

  • Time to start playing
  • No interruptions or issues with playboack
  1. Time to start playing = Latency / SLI = Service Level Indicators = request latency
  2. Error rate = ratio of errors or successes/total requests OR Error rate = errors or successes/throughput (amount of data transmitted per second)

SLI pros, cons and tradeoffs

SLI : {good events/valid events} expressed as a proportion of all valid events

How do you set SLO's for your SLI's?

  • SLO is just a target that you get to pick, so once you've decided on that target you measure the performance of the SLI's against it over a period of time
  • Example: Say our target SLO is 99% of requests will be served within 300ms in the last four weeks, when we measure our SLI we see that only 95% of requests were served within 300ms in the past four weeks thereby missing our target SLO

Edge Cases

  • e.g. Black Friday