gpcdesre/Part_1.md at main

Alex Soul 1bec7162af Linux Academy takedown by CC big brains

2021-02-09 17:36:03 +00:00

10 KiB

Raw Permalink Blame History

Class SRE implements DevOps

BOPT (B)usiness - External Forces; Software development and value stream (O)rganizational - Internal Forces; Teams deciding it wants to structure itsself using DevOps and maybe more specifically SRE (P)rocess/Techniques - Human Considerations; Helps everyone on team to work together (T)echnology/Tools - Nuts and Bolts; Specific tools to implement CI/CD

Google's certifications are tied to a job class analysis.

2hr exam

PCDE are responsible for efficient development operations that can balance service reliability and delivery speed. They are sklled at using GPC to build software delivery pipelines, deploy and monitor services, and manage and learn from incidents.

Useful Links

1 - Medium Blog 2 - Exam Portal - Book Exam 3 - Sample Questions 4 - More Links and Resources from [1] 5 - Google SRE 6 - Google SRE https://myblockchainexperts.org/gcpfreepracticequestions/

What is the business of Software Development?

Alignment - Operations with Development
Software is always an investment (not always!)
- ROI - Add business value
  - Sales/Marketing Atract clients
  - Client support
  - Supplier integration
  - Internal Automation
- Direct Costs
  - Initial Development
  - Operations
  - Maintenance (Dev)
  - Enhancements (Dev)
- Indirect Costs

VALUE/COST = ROI (Get as much value as possible for as little cost as we can)

A 50%-good solution that people actually have solves more problems and survives longer than a 99% solution that nobody has... Shipping is a feature. A really important feature. Your product must have it! (Joel Spolsky - Co-founder of stack overflow)
The fundamental unit of software development is a code change
Every change has:
- Value
- Cost
- Risk!
Every person is on the team, the team needs to work together, integrating work from mulitple people is key (hence CI or CI/CD)

High Level Development Process Data Flow

| Code Change < CODE < Idea on Backlog / Story < TRIAGE < Feedback < ASK APPROVE > Codebase > BUILD > Build (n.) inc Unit tests > DELIVER > Deployable Build > DEPLOY > Running System

DevOps is all about structuring the business to say that, developers should be just as responsible for stuff that goes wrong in production as operations people are. Software development is a team sport.

DevOps is a structure that naturally leads to smaller and smaller change. Devs figure out ways (better automated testing etc) to shrink the impact of each thing they do making code changes smaller and smaller, so the potential negative impact is also smaller.

What is Operations?

Setting things up, initially
Securing things
Deploying new versions of the software/system
Scaling to meet demmand
Patching infrastructure
Backing up
Address outages
Recovering from backup

Dev	Ops
Like buying the machine	Like running the machine
Judged by features	Judged by availability
- Not by quality	- Regardless of system quality

What is a DevOps Engineer?

just another (newer) name for operations/sysadmin
responsible for CI/CD
A dev that does ops?
An Ops person that does dev - like scripting?
An Ops person that does dev - more than scripting? (Google's closet definition)
A myth?

Trainer says that all these definitions are wrong.

DevOps is not a person
DevOps is a way to structure a team
Shared responsibility for all of:
- Developing changes to their system
- Operating their system
- Ensuring quality of their system
- Managing risk (together!)

What is a SRE?

"What happens when a software engineer is tasked with what used to be called operations"
- Bejamin Sloss, founder of Google SRE Team
Develop software to automate tasks all throughout the software development cycle
- Not just Ops
- Not just CI/CD
- Defnitely also includes quality management
An intentional development risk manager
The true subject of Google's "Professional Cloud DevOps Engineer" certification
- Hint: That's this one
Not going to be fully defined in this lesson

What are the common problems? / What are their solutions?

Scale

Problem	Solution
More users than expected	Architect service for scale
Bad actors ( e.g. DDoS)	Design scaling into ops, too
Bad handling	Build in protections
Bad design / Assumptions	Quality control / assurance
Intemittent failures	Code reviews
Uncommon events (corner cases)	Automated testing
Bad failure handling	Gradual rollouts
Code changes	Automated CI/CD (not manual steps)
Config changes	Progressive rollouts e.g. canary releases (groups of users)
Infrastructure changes	Timely monitoring
	Quick response (automatic)
	Safe rollbacks (automatic)
	Minimizing impact

Tensions

It's all about tradeoffs
100% is always the wrong availability target (for us) (not e.g. medical systems e.g. pacemaker)
Use data to base decisions
Hope is not a strategy

Exam Guide Walkthrough - SRE

Applying site reliability engineering principles to a service

Balance 1.1 Balance change, velocity & reliability of the service

Need to understand how to Discover SLI's (Service Level Indicators - availability, latency, etc)
- SLI - Make sure everyone agrees to the same definition of reliability and relatedly performance
- Communicated meaningful information
- SLO - Is what we've agreed upon - Internal targets
Define SLOs and understand SLAs
Agree to consequences of not meeting the error budget
Construct feedback loops to decide what to build next (Understand the whole development cycle)
Toil automation

Management 1.2 Manage service life cycle

Manage a service (e.g. introduce a new service, deploy it, maintain and retire it)
Plan for capacity (e.g. quotas and limits management - automatic/elastic scaling)

Culture 1.3 Ensure healthly communication and collaboration for operations

Prevent burnout (e.g. set up automation processes to prevent burnout)
Foster a learning culture
Foster a culture of blamelessness (Always the teams responsibility - Things go wrong)

Exam Guide Walkthrough - CI/CD

Design 2.1 Design CI/CD pipelines

Immutable artifacts with Container Registry
Artifact repositories with Container Registry
Deployment strategies with Cloud Builder, Spinnaker
Deployment to hybrid & multi-cloud environments aith Anthos, Spinnaker, K8s
Artifact versioning strategy with Cloud Build, Container Registry
CI/CD pipeline triggers with Cloud Source Repositories, Cloud Build Github App, Cloud Pub/Sub
Testing a new version with Spinnaker
Configure deployment processes (e.g. approval flows)

Implement 2.2 Implement CI/CD pipelines

CI with Cloud Build
CD with Cloud Build
Open source tooling (e.g. Jenkins, Spinnaker, Gitlab, Concourse)
Auditing and tracing of deployments (e.g. CSR, Cloud Build, Cloud Audit Logs)

Config 2.3 Manage configuration and secrects

Secure storage methods
Secret rotation and config changes

IAC 2.4 Manage IAC

Terraform / Cloud Deployment Manager
Infrastructure code versioning
Make infrastructure changes safer
Immutable architecture (Creating new resources to replace old ones - Big fan)

Tooling 2.5 Deploy CI/CD Tooling

Centralised tools vs. multiple tools (single vs multi-tenant)
Security of CI/CD tooling

Environments 2.6. Manage different development environments (e.g. staging, production, etc)

Decide on the number of environments and their purpose
Create envs dynamically per feature branch with GKE (namespaces), Cloud Deployment manager
Local development environments with Docker, Cloud code, Skaffold

Pipeline Security 2.7. Secure the deployment pipeline

Vulnerability scanning/analysis with Container registry
Binary authorisation (cluster only allows approves binaries to be deployed to it)
IAM policies per environment (least priviledge)

Exam Guide Walkthrough - Ops

Monitoring & Logging 3. Implementing service monitoring strategies 3.1 Manage application logs - fluentd etc 3.2 Manage application metrics with Stackdriver (deprecated - now Cloud Driver) Monitoring 3.3 Manage Stackdriver Monitoring Platform - Alerting, SLI's SLO's, integrations with grafana, setup with Terraform, send to other tools e.g. datadog, splunk 3.4 Mange Stackdriver Logging platform - Turning logging into metrics 3.5 Implementing logging and monitoring access controls - IAM/Security

Optimizing service performance 4.1 Identify service performance issues 4.2 Debug application code 4.3 Optimize resource utilisation
Manage Service Incidents 5.1 Coordinate roles & implement communication channels during a service incident 5.2 Investigate incident symptoms impacting users with Stackdriver IRM 5.3 Mitigate incident impact on users 5.4 Resolve issues (e.g. Cloud Build, Jenkins) 5.5 Document issue in a postmortem (5Y's)

Teamwork

10 KiB Raw Permalink Blame History