Started first section of Course 4
This commit is contained in:
parent
c8a0bcabef
commit
c51d10402b
281
Part_4.md
Normal file
281
Part_4.md
Normal file
@ -0,0 +1,281 @@
|
||||
Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)
|
||||
===============================================================================================
|
||||
|
||||
### Introduction
|
||||
|
||||
#### About the Course and Learning Path
|
||||
|
||||
Make better software, faster
|
||||
|
||||
#### Milestone: Getting Started
|
||||
|
||||
### Understanding Operations in Context
|
||||
|
||||
#### Section Introduction
|
||||
|
||||
#### What Is Ops?
|
||||
|
||||
- GCP Defined
|
||||
|
||||
- _"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"_
|
||||
|
||||
- Logging Management
|
||||
- Gather Logs, metrics and traces everywhere
|
||||
- Audit, Platform, User logs
|
||||
- Export, Discard, Ingest
|
||||
|
||||
- Error Reporting
|
||||
- So much data, How do you pick out the important indicators?
|
||||
- A centralized error management interface that shows current & past errors
|
||||
- Identify your app's top or new errors at a glance, in a dedicated dashboard
|
||||
- Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
|
||||
|
||||
|
||||
- Across-the-board Monitoring
|
||||
- Dashboards for built-in and customizable visualisations
|
||||
- Monitoring Features
|
||||
- Visual Dashboards
|
||||
- Health Monitoring
|
||||
- Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
|
||||
- Service Monitoring
|
||||
- Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
|
||||
|
||||
- SRE Tracking
|
||||
- Monitoring is critical for SRE
|
||||
- Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
|
||||
- Pinpoint SLI's and develop and SLO on top of it
|
||||
|
||||
- Operational Management
|
||||
- Debugging
|
||||
- Inspects the state of your application at any code location in production without stopping or slowing down requests
|
||||
- Latency Management
|
||||
- Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
|
||||
- Performance Management
|
||||
- Offers continuous profiling of resource consumption in your production applications along with cost management
|
||||
- Security Management
|
||||
- With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud
|
||||
|
||||
**What is Ops: Key Takeaways**
|
||||
|
||||
- Ops Defined: Watch, learn and fix
|
||||
- Primary services: Monitoring and Logging
|
||||
- Monitoring dashboads for all metrics, including health and services (SLOs)
|
||||
- Logs can be exported, discarded, or ingested
|
||||
- SRE depends on ops
|
||||
- Error alerting pinpoints problems, quickly
|
||||
|
||||
Scratch:
|
||||
- Metric query and tracing analysis
|
||||
- Establish performance and reliability indicators
|
||||
- Trigger alerts and error reporting
|
||||
- Logging Features
|
||||
- Error Reporting
|
||||
- SRE Tracking (SLI/SLO)
|
||||
- Performance Management
|
||||
|
||||
|
||||
#### Clarifying the Stackdriver/Operations Connection
|
||||
|
||||
- 2012 - Stackdriver Created
|
||||
- 2014 - Stackdriver Acquired by Google
|
||||
- 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
|
||||
- 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name
|
||||
|
||||
**Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs** (formerly all called Stackdriver <service>)
|
||||
|
||||
"StackDriver" lives on - in the exam only
|
||||
|
||||
Integration + Upgrades
|
||||
|
||||
- Complete UI Integrations
|
||||
- All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
|
||||
- Dashboard API
|
||||
- New API added to allow creation and sharing of dashoards across projects
|
||||
- Log Retention Increased
|
||||
- Logs can now be retained for up to **10 years** and you have control over the time specified
|
||||
- Metrics Enhancement
|
||||
- In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
|
||||
- Advanced Alert Routing
|
||||
- Alerts can now be routed to independent systems that support Cloud Pub/Sub
|
||||
|
||||
#### Operations and SRE: How Do They Relate?
|
||||
|
||||
- Lots of questions in Exam on SRE
|
||||
|
||||
What is SRE? - _"SRE is what happens when a software engineer is tasked with what used to be called operations"_ (Founder Google SRE Team)
|
||||
|
||||
**Pillars of DevOps**
|
||||
|
||||
- Accept failure as normal:
|
||||
- Try to anticipate, but...
|
||||
- Incidents bound to occur
|
||||
- Failures help team learn
|
||||
|
||||
- No-fault postmortems & SLOs:
|
||||
- No two failures the same
|
||||
- Track incidents (SLIs)
|
||||
- Map to Objectives (SLOs)
|
||||
|
||||
- Implement gradual change:
|
||||
- Small updates are better
|
||||
- Easier to review
|
||||
- Easier to rollback
|
||||
|
||||
- Reduce costs of failures:
|
||||
- Limited "canary" rollouts
|
||||
- Impact fewest users
|
||||
- Automate where possible
|
||||
|
||||
- Measure everything:
|
||||
- Critical guage of sucess
|
||||
- CI/CD needs full monitoring
|
||||
- Synthetic, proactive monitoring
|
||||
|
||||
- Measure toil and reliability:
|
||||
- Key to SLOs and SLAs
|
||||
- Reduce toil, up engineering
|
||||
- Monitor all over time
|
||||
|
||||
<hr style="height:2px;border-width:0;color:gray;background-color:gray">
|
||||
|
||||
SLI: _"A carefully defined <u>quantitative</u> measure of some aspect of the level of service that is provided"_
|
||||
|
||||
SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing
|
||||
|
||||
Example SLIs:
|
||||
- Request Latency: How long it takes to return a response to a request
|
||||
- Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
|
||||
- Batch Throughput - Proportion of time = data processing rate > than a threshold
|
||||
|
||||
**Commit to Memory - Google's 4x Golden Signals!**
|
||||
|
||||
- Latency
|
||||
- The time is takes for your service to fulfill a request
|
||||
- Errors
|
||||
- The rate at which your service fails
|
||||
- Traffic
|
||||
- How much demand is directed at your service
|
||||
- Saturation
|
||||
- A measure of how close to fully utilized the services' resources are
|
||||
|
||||
> **LETS**
|
||||
|
||||
<hr style="height:2px;border-width:0;color:gray;background-color:gray">
|
||||
|
||||
SLO: _"Service level objectives (SLOs) specify a target level for the reliability of your service"_ - The site reliability workbook
|
||||
|
||||
SLOs are tied to you SLIs
|
||||
- Measured by SLLI
|
||||
- Can be a single target value or range of values
|
||||
- SLIs <= SLO
|
||||
- or
|
||||
- (lower bound <= SLI <= upper bound) = SLO
|
||||
- Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)
|
||||
|
||||
SLI - Metric over time which detail the health of a service
|
||||
- example: `Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile`
|
||||
|
||||
SLO - Agreed-upon bounds how often SLIs must be met
|
||||
- example: `95% percentile homepage SLI will suceed 99.9% of the time over the next year`
|
||||
|
||||
Phases of Service Lifetime
|
||||
|
||||
SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:
|
||||
- Measure and track SLIs (Measuring increasing performance)
|
||||
- Evaluate reliability
|
||||
- Define SLOs
|
||||
- Build capacity models
|
||||
- Establish incident response, shared with dev team
|
||||
|
||||
General Availability Phase
|
||||
- After Production Readiness Review passed
|
||||
- SREs handle majority of op work
|
||||
- Incident responses
|
||||
- Track operational load and SLOs
|
||||
|
||||
**Ops & SRE: Key Takeaways**
|
||||
- SRE: Operations from a software engineer
|
||||
- Many shared pillars between DevOps/SRE
|
||||
- SLIs are quantitative metrics over time
|
||||
- Remember the 4x Google Golden Signals (LETS)
|
||||
- SLOs are a target objective for reliability
|
||||
- SLIs are lower then SLO - or - in-between upper and lower bound
|
||||
- SREs are most active in limited availability and general availability phases
|
||||
|
||||
|
||||
#### Operation Services at a Glance
|
||||
|
||||
#### Section Review
|
||||
|
||||
#### Milestone: The Weight of the World (Teamwork, Not Superheroes)
|
||||
|
||||
|
||||
Monitoring Your Operations
|
||||
Section Introduction
|
||||
Cloud Monitoring Concepts
|
||||
Monitoring Workspaces Concepts
|
||||
Monitoring Workspaces
|
||||
Perspective: Workspaces in Context
|
||||
What Are Metrics?
|
||||
Exploring Workspace and Metrics
|
||||
Monitoring Agent Concepts
|
||||
Installing the Monitoring Agent
|
||||
Collecting Monitoring Agent Metrics
|
||||
Integration with Monitoring API
|
||||
Create Dashboards with Command Line
|
||||
GKE Metrics
|
||||
Perspective: What's Up, Doc?
|
||||
Uptime Checks
|
||||
Establishing Human-Actionable and Automated Alerts
|
||||
Section Review
|
||||
Milestone: Spies Everywhere! (Check Those Vitals!)
|
||||
Hands-On Lab:
|
||||
Install and Configure Monitoring Agent with Google Cloud Monitoring
|
||||
Logging Activities
|
||||
Section Introduction
|
||||
Cloud Logging Fundamentals
|
||||
Log Types and Mechanics
|
||||
Cloud Logging Tour
|
||||
Logging Agent Concepts
|
||||
Install Logging Agent and Collect Agent Logs
|
||||
Logging Filters
|
||||
Hands-On with Advanced Filters
|
||||
VPC Flow Logs
|
||||
Firewall Logs
|
||||
VPC Flow Logs and Firewall Logs Demo
|
||||
Routing and Exporting Logs
|
||||
Export Logs to BigQuery
|
||||
Logs-Based Metrics
|
||||
Section Review
|
||||
Milestone: Let the Record Show
|
||||
Hands-On Lab:
|
||||
Install and Configure Logging Agent on Google Cloud
|
||||
SRE and Alerting Policies
|
||||
SLOs and Alerting Strategy
|
||||
Service Monitoring
|
||||
Milestone: Come Together, Right Now, SRE
|
||||
Optimize Performance with Trace/Profiler
|
||||
Section Introduction
|
||||
What the Services Do and Why They Matter
|
||||
Tracking Latency with Cloud Trace
|
||||
Accessing the Cloud Trace APIs
|
||||
Setting Up Your App with Cloud Profiler
|
||||
Analyzing Cloud Profiler Data
|
||||
Section Review
|
||||
Milestone: It All Adds Up!
|
||||
Hands-On Lab:
|
||||
Discovering Latency with Google Cloud Trace
|
||||
Identifying Application Errors with Debug/Error Reporting
|
||||
Section Introduction
|
||||
Troubleshooting with Cloud Debugger
|
||||
Establishing Error Reporting for Your App
|
||||
Managing Errors and Handling Notifications
|
||||
Section Review
|
||||
Milestone: Come Together - Reprise (Debug Is De Solution)
|
||||
Hands-On Lab:
|
||||
Correcting Code with Cloud Debugger
|
||||
Course Conclusion
|
||||
Milestone: Are We There, Yet?
|
||||
landscape
|
||||
Practice Exam / Quiz:
|
||||
Google Certified Professional Cloud DevOps Engineer Exam Prep
|
||||
Loading…
Reference in New Issue
Block a user