Initial commit

This commit is contained in:
Alex Soul 2020-10-23 09:58:26 +01:00
commit 649969cdf7
2 changed files with 479 additions and 0 deletions

242
Part_1.md Normal file
View File

@ -0,0 +1,242 @@
Class SRE implements DevOps
BOPT
(B)usiness - External Forces; Software development and value stream
(O)rganizational - Internal Forces; Teams deciding it wants to structure itsself using DevOps and maybe more specifically SRE
(P)rocess/Techniques - Human Considerations; Helps everyone on team to work together
(T)echnology/Tools - Nuts and Bolts; Specific tools to implement CI/CD
Google's certifications are tied to a job class analysis.
- 2hr exam
PCDE are responsible for efficient development operations that can balance service reliability and delivery speed. They are sklled at using GPC to build software delivery pipelines, deploy and monitor services, and manage and learn from incidents.
### What is the business of Software Development?
- Alignment - Operations with Development
- Software is always an investment (not always!)
- ROI - Add business value
- Sales/Marketing Atract clients
- Client support
- Supplier integration
- Internal Automation
- Direct Costs
- Initial Development
- Operations
- Maintenance (Dev)
- Enhancements (Dev)
- Indirect Costs
VALUE/COST = ROI (Get as much value as possible for as little cost as we can)
* A 50%-good solution that people actuall have solves more problems and survives longer than a 99% solution that nobody has... Shipping is a feature. A really important feature. Your product must have it! (Joel Spolsky - Co-founder of stack overflow)
* The fundamental unit of software development is a code _change_
* Every change has:
* Value
* Cost
* Risk!
* Every person is on the team, the team needs to work together, integrating work from mulitple people is key (hence CI or CI/CD)
### High Level Development Process Data Flow
| Code Change < CODE < Idea on Backlog / Story < TRIAGE < Feedback < ASK
APPROVE
\> Codebase > BUILD > Build (n.) inc Unit tests > DELIVER > Deployable Build > DEPLOY > Running System
DevOps is all about structuring the business to say that, developers should be just as responsible for stuff that goes wrong in production as operations people are. Software development is a team sport.
DevOps is a structure that naturally leads to smaller and smaller change. Devs figure out ways (better automated testing etc) to shrink the impact of each thing they do making code changes smaller and smaller, so the potential negative impact is also smaller.
### What is Operations?
* Setting things up, initially
* Securing things
* Deploying new versions of the software/system
* Scaling to meet demmand
* Patching infrastructure
* Backing up
* Address outages
* Recovering from backup
| Dev | Ops |
|--|--|
| Like buying the machine | Like running the machine |
| Judged by features | Judged by availability |
| - Not by quality | - Regardless of system quality |
### What is a DevOps Engineer?
* just another (newer) name for operations/sysadmin
* responsible for CI/CD
* A dev that does ops?
* An Ops person that does dev - like scripting?
* An Ops person that does dev - more than scripting? (Google's closet definition)
* A myth?
Trainer says that all these definitions are wrong.
* DevOps is not a person
* DevOps is a way to structure a team
* Shared responsibility for all of:
* Developing changes to their system
* Operating their system
* Ensuring quality of their system
* Managing risk (together!)
### What is a SRE?
* "What happens when a software engineer is tasked with what used to be called operations"
* Bejamin Sloss, founder of Google SRE Team
* Develop software to automate tasks all throughout the software development cycle
* Not just Ops
* Not just CI/CD
* Defniitely also includes quality management
* An intentional development risk manager
* The true subject of Google's "Professional Cloud DevOps Engineer" certification
* Hint: That's this one
* Not going to be fully defined in this lesson
### What are the common problems? / What are their solutions?
Scale
| Problem | Solution |
| -- | -- |
| More users than expected | Architect service for scale |
| Bad actors ( e.g. DDoS) | Design scaling into ops, too |
| Bad handling | Build in protections |
| Bad design / Assumptions | Quality control / assurance |
| Intemittent failures | Code reviews |
| Uncommon events (corner cases) | Automated testing |
| Bad failure handling | Gradual rollouts |
| Code changes | Automated CI/CD (not manual steps) |
| Config changes | Progressive rollouts e.g. canary releases (groups of users) |
| Infrastructure changes | Timely monitoring |
| | Quick response (automatic) |
| | Safe rollbacks (automatic) |
| | Minimizing impact |
Tensions
* It's all about tradeoffs
* 100% is always the wrong availability target (for us) (not e.g. medical systems e.g. pacemaker)
* Use data to base decisions
* Hope is not a strategy
### [Exam Guide Walkthrough]( https://cloud.google.com/certification/guides/cloud-devops-engineer) - SRE
* Applying site reliability engineering principles to a service
Balance
1.1 Balance change, velocity & reliability of the service
- Need to understand how to Discover SLI's (Service Level Indicators - availability, latency, etc)
- SLI - Make sure everyone agrees to the same definition of reliability and relatedly performance
- Communicated meaningful information
- SLO - Is what we've agreed upon - Internal targets
- Define SLOs and understand SLAs
- Agree to consequences of not meeting the error budget
- Construct feedback loops to decide what to build next (Understand the whole development cycle)
- Toil automation
Management
1.2 Manage service life cycle
- Manage a service (e.g. introduct a new service, deploy it and maintain and retire it)
- Plan for capacity (e.g. quotas and limits management - automatic/elastic scaling)
Culture
1.3 Ensure healthly communication and collaboration for operations
- Prevent burnout (e.g. set up automation processes to prevent burnout)
- Foster a learning culture
- Foster a culture of blamelessness (Always the teams responsibility - Things go wrong)
### Exam Guide Walkthrough - CI/CD
Design
2.1 Design CI/CD pipelines
- Immutable artifacts with Container Registry
- Artifact repositories with Container Registry
- Deployment strategies with Cloud Builder, Spinnaker
- Deployment to hybrid & multi-cloud environments aith Anthos, Spinnaker, K8s
- Artifact versioning strategy with Cloud Build, Container Registry
- CI/CD pipeline triggers with Cloud Source Repositories, Cloud Build Github App, Cloud Pub/Sub
- Testing a new version with Spinnaker
- Configure deployment processes (e.g. approval flows)
Implement
2.2 Implement CI/CD pipelines
- CI with Cloud Build
- CD with Cloud Build
- Open source tooling (e.g. Jenkins, Spinnaker, Gitlab, Concourse)
- Auditing and tracing of deployments (e.g. CSR, Cloud Build, Cloud Audit Logs)
Config
2.3 Manage configuration and secrects
- Secure storage methods
- Secret rotation and config changes
IAC
2.4 Manage IAC
- Terraform / Cloud Deployment Manager
- Infrastructure code versioning
- Make infrastructure changes safer
- Immutable architecture (Creating new resources to replace old ones - Big fan)
Tooling
2.5 Deploy CI/CD Tooling
- Centralised tools vs. multiple tools (single vs multi-tenant)
- Security of CI/CD tooling
Environments
2.6. Manage different development environments (e.g. staging, production, etc)
- Decide on the number of environments and their purpose
- Create envs dynamically per feature branch with GKE (namespaces), Cloud Deployment manager
- Local development environments with Docker, Cloud code, Skaffold
Pipeline Security
2.7. Secure the deployment pipeline
- Vulnerability scanning/analysis with Container registry
- Binary authorisation (cluster only allows approves binaries to be deployed to it)
- IAM policies per environment (least priviledge)
### Exam Guide Walkthrough - Ops
Monitoring & Logging
3. Implementing service monitoring strategies
3.1 Manage application logs - fluentd etc
3.2 Manage application metrics with Stackdriver (deprecated - now Cloud Driver) Monitoring
3.3 Manage Stackdriver Monitoring Platform - Alerting, SLI's SLO's, integrations with grafana, setup with Terraform, send to other tools e.g. datadog, splunk
3.4 Mange Stackdriver Logging platform - Turning logging into metrics
3.5 Implementing logging and monitoring access controls - IAM/Security
4. Optimizing service performance
4.1 Identify service performance issues
4.2 Debug application code
4.3 Optimize resource utilisation
5. Manage Service Incidents
5.1 Coordinate roles & implement communication channels during a service incident
5.2 Investigate incident symptoms impacting users with Stackdriver IRM
5.3 Mitigate incident impact on users
5.4 Resolve issues (e.g. Cloud Build, Jenkins)
5.5 Document issue in a postmortem (5Y's)
*Teamwork*

237
Part_2.md Normal file
View File

@ -0,0 +1,237 @@
Big Picture - What is SRE?
### 5 Key Pillars of DevOps + SRE
- Reduce organisational silos
- Bridge teams together
- Increase communication
- Shared company vision
**Share ownership**
- Developers + Operations
- Implement same tooling
- share same techniques
-------------------------------------
- Accept failure as normal
- Try to anticipate, but
- Incidents are bound to occur
- Failure help team learn
**No-fault post mortems & SLOs**
- No two failures the same (goal)
- Track incidents (SLI's)
- Map to objectives (SLOs)
--------------------------------------
- Implement gradual change
- Continuous change culture
- Small updates are better
- Easier to review
- Easier to rollback
**Redcue costs of failures**
- Limited "canary" rollouts
- Impact fewest users
- Automate where possible for further cost reduction
-------------------------------------
- Leverage Tooling and automation
- Reduce manual tasks
- The heat of the cI/CD pipelines
- Fosters speed & consistency
**Automate this years job away**
- Automation is a force multiplier
- Autonomous automation best
- Centralizes mistakes
-------------------------------------
- Measure Everything
- Critical guage of sucess
- CI/CD needs full monitoring
- Synthetic, proactive monitoring
Measure toil and reliability
- Key to SLOs and SLAs
- Reduce toil (aka repetitive manual labour!), up engineering
- Monitor all over time
-------------------------------------
Why "Reliability"
- Most important: does the product work?
- Reliability is the absense of errors
- Unstable service likely indicates a variety of issues
- Must attend to reliability all the time
(class SRE) = The how implements (DevOps) = The What
> Make better software, faster
### Understanding SLIs
SRE breaks down into 3 distinct functions
1. Define availability
1. SLO:
2. Determine level of availability
1. SLI - Quantifiable measure of reliability; Metrics over time, specific to a user journey, such as request/reponse, data processing or storage. Examples:
1. Request latency - How long it takes to return a response to a request
2. Failure Rate - A fraction of all rates received: (unsuccessful requests/all requests)
3. Batch thoughput - Proportion of time = data processing > than a threshold
3. Plan in case of failure
1. SLA
> Each maps to a key component; SLO, SLI, SLA
What's a User Journey?
* Sequence of tasks central to user experience and crucial to service
* e.g. Online shopping journeys
* Product search
* Add to cart
* Checkout
Request/Response Journey:
* Availability - Proportion of valid requests served successfully
* Latency - Proportion of valid requests served faster than a threshold
* Quality - Proportion of valid requests served maintaining quality
> None of these map specifically to a user journey, however they are all part of that
Data processing journey: Might include a different set of SLIs
* Freshness - Proportion of valid data updated more recently than a threshold
* Correctness - Proportion of valid data producing correct output
* Throughput - Proportion of time where the data processing rate is faster than a threshold
* Coverage - Proportion of valid data processed successfully
Google's 4 Golden Signals
* Latency - The time is takes for your service to fulfill a request
* Errors - The rate at which your service fails
* Traffic - How much demand is directed at your service
* Saturation - A measure of how close to fully utilized the service's resources are
Transparent SLI's within GCP Dashboard - API's & Services
**The SLI Equation:**
SLI = (Good Events / Valid Events) * 100
Valid - Known bad events are excluded from the SLI e.g. 400 http
**Bad SLI** - Variance and overlap in metrics prior to and during outages are problematic; graph contains up and down spikes during an outage
**Good SLI** - Stable signal with a strong correlation to outage is best; graph is smooth.
#### SLI Best Practices
1. Limit number of SLIs
* 3-5 per user journey
* Too many increase difficulty for operators
* Can lead to contradictions
2. Reduce complexity
* Not all metrics make good SLIs
* Increased response time
* Many false positive
3. Prioritize Journeys
* Select most valuable to users
* Identify user-centric events
4. Aggregate similar SLIs
* Collect data over time
* Turn into a rate, average, or percentile
5. Bucket to distinguish response classes
* Not all request are same
* Requesters may be human, background apps or bots
* Combine (or "bucket") for better SLIs
6. Collect data at load balancer
* Most efficient method
* Closer to users's experience
### Understanding SLOs
"SLO's specify a target level for the reliability of your service"
The First rule of SLOs: 100 % reliability is not a good objective
Why?
- Trying to reach 100%, 100% of the time, is very expensive in terms of resources
- Much more technically complex
- Users don't need 100% to be acceptable (get close enough where users don't notice the difference, just reliable enough)
- Less than 100% leaves room for new features, as you have resources remaining to develop (error budgets)
SLOs are tied directly to SLIs
- Measured by SLI
- Can be a single target value or range of values
- e.g. SLI <= SLO
or
- (lower bound <= SLI <= upper bound) = SLO
- Common SLOs: 99.5%, 99.9% (3x 9's), 9.999% (4x 9's)
SLI - Metrics over time which detail the health of a service
```
Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile
```
SLO - Agreed-upon bounds on how often SLIs must be met
```
95% percentile homepage SLI will sucess 99.9% of the time over the next year
```
SLO - Critical that there is buy-in from across the organisation. Make sure all stakeholders agree on the SLO. Everyone on the same team working towards the same goals. Developers, Contributors, Project Managers, SREs, Vice President
#### Make your SLOs achievable
- Based on past performance
- Users expectations are strongly tied to past performance
- If no historical data, you need to collect some
- Keep in mind: measurement <> User satisfaction, and you may need to adjust your SLOs accordingly
#### In addition to achievable SLOs, you might have some aspirational SLOs
- Typically higher than your achievable SLO's
- Set a reasonable target and begin measuring
- Compare user feedback to SLOs
### Understanding SLAs
_"We've the determined the level of availability with our SLIs and declared what target of availability we want to reach with our SLO's and now we need to describe what happens if we don't maintain that availability with an SLA"_
_"An explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain"_
- Should reliability fail, there are consequences
#### SLA Characteristics
- A business-level agreement
- SRE's are not usually involved with the drafting of SLA's except when setting up their SLIs and the corresponding SLOs
- Can be explicit or implied (implicit)
- Explicit contract contain consequences
- Refund for services paid for
- Service cost reduction on sliding scales
- May be offered on a per service basis
41 SLAs in GCP
- Compute Engine
- 4x 9's for Instance uptime in multiple zones
- 99.5% for uptime for a single instance
- 4x 9's for load balancing uptime
- If these aren't met, and customer does meet its obligations then the customer could be eligible to some financial credits
- Clear definitions for language
SLIs drives SLOs, SLOs inform the SLA
- How does the SLO inform the SLA; example: