From c51d10402ba4fee2843f86525826632ee885a2e9 Mon Sep 17 00:00:00 2001 From: Alex Soul Date: Wed, 3 Feb 2021 16:54:24 +0000 Subject: [PATCH] Started first section of Course 4 --- Part_4.md | 281 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 281 insertions(+) create mode 100644 Part_4.md diff --git a/Part_4.md b/Part_4.md new file mode 100644 index 0000000..151be45 --- /dev/null +++ b/Part_4.md @@ -0,0 +1,281 @@ +Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4) +=============================================================================================== + +### Introduction + +#### About the Course and Learning Path + +Make better software, faster + +#### Milestone: Getting Started + +### Understanding Operations in Context + +#### Section Introduction + +#### What Is Ops? + +- GCP Defined + + - _"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"_ + + - Logging Management + - Gather Logs, metrics and traces everywhere + - Audit, Platform, User logs + - Export, Discard, Ingest + + - Error Reporting + - So much data, How do you pick out the important indicators? + - A centralized error management interface that shows current & past errors + - Identify your app's top or new errors at a glance, in a dedicated dashboard + - Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace + + + - Across-the-board Monitoring + - Dashboards for built-in and customizable visualisations + - Monitoring Features + - Visual Dashboards + - Health Monitoring + - Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers) + - Service Monitoring + - Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's) + + - SRE Tracking + - Monitoring is critical for SRE + - Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs + - Pinpoint SLI's and develop and SLO on top of it + + - Operational Management + - Debugging + - Inspects the state of your application at any code location in production without stopping or slowing down requests + - Latency Management + - Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics + - Performance Management + - Offers continuous profiling of resource consumption in your production applications along with cost management + - Security Management + - With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud + +**What is Ops: Key Takeaways** + +- Ops Defined: Watch, learn and fix +- Primary services: Monitoring and Logging +- Monitoring dashboads for all metrics, including health and services (SLOs) +- Logs can be exported, discarded, or ingested +- SRE depends on ops +- Error alerting pinpoints problems, quickly + +Scratch: +- Metric query and tracing analysis +- Establish performance and reliability indicators +- Trigger alerts and error reporting +- Logging Features +- Error Reporting +- SRE Tracking (SLI/SLO) +- Performance Management + + +#### Clarifying the Stackdriver/Operations Connection + +- 2012 - Stackdriver Created +- 2014 - Stackdriver Acquired by Google +- 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available +- 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name + +**Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs** (formerly all called Stackdriver ) + +"StackDriver" lives on - in the exam only + +Integration + Upgrades + +- Complete UI Integrations + - All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console +- Dashboard API + - New API added to allow creation and sharing of dashoards across projects +- Log Retention Increased + - Logs can now be retained for up to **10 years** and you have control over the time specified +- Metrics Enhancement + - In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds) +- Advanced Alert Routing + - Alerts can now be routed to independent systems that support Cloud Pub/Sub + +#### Operations and SRE: How Do They Relate? + +- Lots of questions in Exam on SRE + +What is SRE? - _"SRE is what happens when a software engineer is tasked with what used to be called operations"_ (Founder Google SRE Team) + +**Pillars of DevOps** + +- Accept failure as normal: + - Try to anticipate, but... + - Incidents bound to occur + - Failures help team learn + +- No-fault postmortems & SLOs: + - No two failures the same + - Track incidents (SLIs) + - Map to Objectives (SLOs) + +- Implement gradual change: + - Small updates are better + - Easier to review + - Easier to rollback + +- Reduce costs of failures: + - Limited "canary" rollouts + - Impact fewest users + - Automate where possible + +- Measure everything: + - Critical guage of sucess + - CI/CD needs full monitoring + - Synthetic, proactive monitoring + +- Measure toil and reliability: + - Key to SLOs and SLAs + - Reduce toil, up engineering + - Monitor all over time + +
+ +SLI: _"A carefully defined quantitative measure of some aspect of the level of service that is provided"_ + +SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing + +Example SLIs: +- Request Latency: How long it takes to return a response to a request +- Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests) +- Batch Throughput - Proportion of time = data processing rate > than a threshold + +**Commit to Memory - Google's 4x Golden Signals!** + +- Latency + - The time is takes for your service to fulfill a request +- Errors + - The rate at which your service fails +- Traffic + - How much demand is directed at your service +- Saturation + - A measure of how close to fully utilized the services' resources are + +> **LETS** + +
+ +SLO: _"Service level objectives (SLOs) specify a target level for the reliability of your service"_ - The site reliability workbook + +SLOs are tied to you SLIs +- Measured by SLLI +- Can be a single target value or range of values +- SLIs <= SLO +- or +- (lower bound <= SLI <= upper bound) = SLO +- Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's) + +SLI - Metric over time which detail the health of a service + - example: `Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile` + +SLO - Agreed-upon bounds how often SLIs must be met + - example: `95% percentile homepage SLI will suceed 99.9% of the time over the next year` + +Phases of Service Lifetime + +SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to: +- Measure and track SLIs (Measuring increasing performance) +- Evaluate reliability +- Define SLOs +- Build capacity models +- Establish incident response, shared with dev team + +General Availability Phase +- After Production Readiness Review passed +- SREs handle majority of op work +- Incident responses +- Track operational load and SLOs + +**Ops & SRE: Key Takeaways** +- SRE: Operations from a software engineer +- Many shared pillars between DevOps/SRE +- SLIs are quantitative metrics over time +- Remember the 4x Google Golden Signals (LETS) +- SLOs are a target objective for reliability +- SLIs are lower then SLO - or - in-between upper and lower bound +- SREs are most active in limited availability and general availability phases + + +#### Operation Services at a Glance + +#### Section Review + +#### Milestone: The Weight of the World (Teamwork, Not Superheroes) + + +Monitoring Your Operations +Section Introduction +Cloud Monitoring Concepts +Monitoring Workspaces Concepts +Monitoring Workspaces +Perspective: Workspaces in Context +What Are Metrics? +Exploring Workspace and Metrics +Monitoring Agent Concepts +Installing the Monitoring Agent +Collecting Monitoring Agent Metrics +Integration with Monitoring API +Create Dashboards with Command Line +GKE Metrics +Perspective: What's Up, Doc? +Uptime Checks +Establishing Human-Actionable and Automated Alerts +Section Review +Milestone: Spies Everywhere! (Check Those Vitals!) +Hands-On Lab: +Install and Configure Monitoring Agent with Google Cloud Monitoring +Logging Activities +Section Introduction +Cloud Logging Fundamentals +Log Types and Mechanics +Cloud Logging Tour +Logging Agent Concepts +Install Logging Agent and Collect Agent Logs +Logging Filters +Hands-On with Advanced Filters +VPC Flow Logs +Firewall Logs +VPC Flow Logs and Firewall Logs Demo +Routing and Exporting Logs +Export Logs to BigQuery +Logs-Based Metrics +Section Review +Milestone: Let the Record Show +Hands-On Lab: +Install and Configure Logging Agent on Google Cloud +SRE and Alerting Policies +SLOs and Alerting Strategy +Service Monitoring +Milestone: Come Together, Right Now, SRE +Optimize Performance with Trace/Profiler +Section Introduction +What the Services Do and Why They Matter +Tracking Latency with Cloud Trace +Accessing the Cloud Trace APIs +Setting Up Your App with Cloud Profiler +Analyzing Cloud Profiler Data +Section Review +Milestone: It All Adds Up! +Hands-On Lab: +Discovering Latency with Google Cloud Trace +Identifying Application Errors with Debug/Error Reporting +Section Introduction +Troubleshooting with Cloud Debugger +Establishing Error Reporting for Your App +Managing Errors and Handling Notifications +Section Review +Milestone: Come Together - Reprise (Debug Is De Solution) +Hands-On Lab: +Correcting Code with Cloud Debugger +Course Conclusion +Milestone: Are We There, Yet? +landscape +Practice Exam / Quiz: +Google Certified Professional Cloud DevOps Engineer Exam Prep