10 KiB
Class SRE implements DevOps
BOPT (B)usiness - External Forces; Software development and value stream (O)rganizational - Internal Forces; Teams deciding it wants to structure itsself using DevOps and maybe more specifically SRE (P)rocess/Techniques - Human Considerations; Helps everyone on team to work together (T)echnology/Tools - Nuts and Bolts; Specific tools to implement CI/CD
Google's certifications are tied to a job class analysis.
- 2hr exam
PCDE are responsible for efficient development operations that can balance service reliability and delivery speed. They are sklled at using GPC to build software delivery pipelines, deploy and monitor services, and manage and learn from incidents.
Useful Links
1 - Medium Blog 2 - Exam Portal - Book Exam 3 - Sample Questions 4 - More Links and Resources from [1] 5 - Google SRE 6 - Google SRE https://myblockchainexperts.org/gcpfreepracticequestions/
What is the business of Software Development?
- Alignment - Operations with Development
- Software is always an investment (not always!)
- ROI - Add business value
- Sales/Marketing Atract clients
- Client support
- Supplier integration
- Internal Automation
- Direct Costs
- Initial Development
- Operations
- Maintenance (Dev)
- Enhancements (Dev)
- Indirect Costs
- ROI - Add business value
VALUE/COST = ROI (Get as much value as possible for as little cost as we can)
-
A 50%-good solution that people actually have solves more problems and survives longer than a 99% solution that nobody has... Shipping is a feature. A really important feature. Your product must have it! (Joel Spolsky - Co-founder of stack overflow)
-
The fundamental unit of software development is a code change
-
Every change has:
- Value
- Cost
- Risk!
-
Every person is on the team, the team needs to work together, integrating work from mulitple people is key (hence CI or CI/CD)
High Level Development Process Data Flow
| Code Change < CODE < Idea on Backlog / Story < TRIAGE < Feedback < ASK APPROVE > Codebase > BUILD > Build (n.) inc Unit tests > DELIVER > Deployable Build > DEPLOY > Running System
DevOps is all about structuring the business to say that, developers should be just as responsible for stuff that goes wrong in production as operations people are. Software development is a team sport.
DevOps is a structure that naturally leads to smaller and smaller change. Devs figure out ways (better automated testing etc) to shrink the impact of each thing they do making code changes smaller and smaller, so the potential negative impact is also smaller.
What is Operations?
- Setting things up, initially
- Securing things
- Deploying new versions of the software/system
- Scaling to meet demmand
- Patching infrastructure
- Backing up
- Address outages
- Recovering from backup
| Dev | Ops |
|---|---|
| Like buying the machine | Like running the machine |
| Judged by features | Judged by availability |
| - Not by quality | - Regardless of system quality |
What is a DevOps Engineer?
- just another (newer) name for operations/sysadmin
- responsible for CI/CD
- A dev that does ops?
- An Ops person that does dev - like scripting?
- An Ops person that does dev - more than scripting? (Google's closet definition)
- A myth?
Trainer says that all these definitions are wrong.
- DevOps is not a person
- DevOps is a way to structure a team
- Shared responsibility for all of:
- Developing changes to their system
- Operating their system
- Ensuring quality of their system
- Managing risk (together!)
What is a SRE?
-
"What happens when a software engineer is tasked with what used to be called operations"
- Bejamin Sloss, founder of Google SRE Team
-
Develop software to automate tasks all throughout the software development cycle
- Not just Ops
- Not just CI/CD
- Defnitely also includes quality management
-
An intentional development risk manager
-
The true subject of Google's "Professional Cloud DevOps Engineer" certification
- Hint: That's this one
-
Not going to be fully defined in this lesson
What are the common problems? / What are their solutions?
Scale
| Problem | Solution |
|---|---|
| More users than expected | Architect service for scale |
| Bad actors ( e.g. DDoS) | Design scaling into ops, too |
| Bad handling | Build in protections |
| Bad design / Assumptions | Quality control / assurance |
| Intemittent failures | Code reviews |
| Uncommon events (corner cases) | Automated testing |
| Bad failure handling | Gradual rollouts |
| Code changes | Automated CI/CD (not manual steps) |
| Config changes | Progressive rollouts e.g. canary releases (groups of users) |
| Infrastructure changes | Timely monitoring |
| Quick response (automatic) | |
| Safe rollbacks (automatic) | |
| Minimizing impact |
Tensions
- It's all about tradeoffs
- 100% is always the wrong availability target (for us) (not e.g. medical systems e.g. pacemaker)
- Use data to base decisions
- Hope is not a strategy
Exam Guide Walkthrough - SRE
- Applying site reliability engineering principles to a service
Balance 1.1 Balance change, velocity & reliability of the service
- Need to understand how to Discover SLI's (Service Level Indicators - availability, latency, etc)
- SLI - Make sure everyone agrees to the same definition of reliability and relatedly performance
- Communicated meaningful information
- SLO - Is what we've agreed upon - Internal targets
- Define SLOs and understand SLAs
- Agree to consequences of not meeting the error budget
- Construct feedback loops to decide what to build next (Understand the whole development cycle)
- Toil automation
Management 1.2 Manage service life cycle
- Manage a service (e.g. introduce a new service, deploy it, maintain and retire it)
- Plan for capacity (e.g. quotas and limits management - automatic/elastic scaling)
Culture 1.3 Ensure healthly communication and collaboration for operations
- Prevent burnout (e.g. set up automation processes to prevent burnout)
- Foster a learning culture
- Foster a culture of blamelessness (Always the teams responsibility - Things go wrong)
Exam Guide Walkthrough - CI/CD
Design 2.1 Design CI/CD pipelines
- Immutable artifacts with Container Registry
- Artifact repositories with Container Registry
- Deployment strategies with Cloud Builder, Spinnaker
- Deployment to hybrid & multi-cloud environments aith Anthos, Spinnaker, K8s
- Artifact versioning strategy with Cloud Build, Container Registry
- CI/CD pipeline triggers with Cloud Source Repositories, Cloud Build Github App, Cloud Pub/Sub
- Testing a new version with Spinnaker
- Configure deployment processes (e.g. approval flows)
Implement 2.2 Implement CI/CD pipelines
- CI with Cloud Build
- CD with Cloud Build
- Open source tooling (e.g. Jenkins, Spinnaker, Gitlab, Concourse)
- Auditing and tracing of deployments (e.g. CSR, Cloud Build, Cloud Audit Logs)
Config 2.3 Manage configuration and secrects
- Secure storage methods
- Secret rotation and config changes
IAC 2.4 Manage IAC
- Terraform / Cloud Deployment Manager
- Infrastructure code versioning
- Make infrastructure changes safer
- Immutable architecture (Creating new resources to replace old ones - Big fan)
Tooling 2.5 Deploy CI/CD Tooling
- Centralised tools vs. multiple tools (single vs multi-tenant)
- Security of CI/CD tooling
Environments 2.6. Manage different development environments (e.g. staging, production, etc)
- Decide on the number of environments and their purpose
- Create envs dynamically per feature branch with GKE (namespaces), Cloud Deployment manager
- Local development environments with Docker, Cloud code, Skaffold
Pipeline Security 2.7. Secure the deployment pipeline
- Vulnerability scanning/analysis with Container registry
- Binary authorisation (cluster only allows approves binaries to be deployed to it)
- IAM policies per environment (least priviledge)
Exam Guide Walkthrough - Ops
Monitoring & Logging 3. Implementing service monitoring strategies 3.1 Manage application logs - fluentd etc 3.2 Manage application metrics with Stackdriver (deprecated - now Cloud Driver) Monitoring 3.3 Manage Stackdriver Monitoring Platform - Alerting, SLI's SLO's, integrations with grafana, setup with Terraform, send to other tools e.g. datadog, splunk 3.4 Mange Stackdriver Logging platform - Turning logging into metrics 3.5 Implementing logging and monitoring access controls - IAM/Security
-
Optimizing service performance 4.1 Identify service performance issues 4.2 Debug application code 4.3 Optimize resource utilisation
-
Manage Service Incidents 5.1 Coordinate roles & implement communication channels during a service incident 5.2 Investigate incident symptoms impacting users with Stackdriver IRM 5.3 Mitigate incident impact on users 5.4 Resolve issues (e.g. Cloud Build, Jenkins) 5.5 Document issue in a postmortem (5Y's)
Teamwork