Started first section of Course 4

2021-02-03 16:54:24 +00:00 · 2021-02-03 16:54:24 +00:00 · c51d10402b
commit c51d10402b
parent c8a0bcabef
1 changed files with 281 additions and 0 deletions
--- a/Part_4.md
+++ b/Part_4.md
@ -0,0 +1,281 @@
+Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)
+===============================================================================================
+
+### Introduction 
+
+#### About the Course and Learning Path
+
+Make better software, faster
+
+#### Milestone: Getting Started
+
+### Understanding Operations in Context
+
+#### Section Introduction
+
+#### What Is Ops?
+
+- GCP Defined
+
+  - _"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"_
+
+  - Logging Management  
+    - Gather Logs, metrics and traces everywhere
+      - Audit, Platform, User logs
+        - Export, Discard, Ingest
+
+  - Error Reporting
+    - So much data, How do you pick out the important indicators?
+      - A centralized error management interface that shows current & past errors
+      - Identify your app's top or new errors at a glance, in a dedicated dashboard
+      - Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
+
+
+  - Across-the-board Monitoring
+    - Dashboards for built-in and customizable visualisations
+      - Monitoring Features
+        - Visual Dashboards
+    - Health Monitoring
+      - Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
+    - Service Monitoring
+      - Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
+
+  - SRE Tracking
+    - Monitoring is critical for SRE
+      - Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
+        - Pinpoint SLI's and develop and SLO on top of it
+
+  - Operational Management
+    - Debugging
+      - Inspects the state of your application at any code location in production without stopping or slowing down requests
+    - Latency Management
+      - Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
+    - Performance Management
+      - Offers continuous profiling of resource consumption in your production applications along with cost management
+    - Security Management
+      - With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud
+
+**What is Ops: Key Takeaways**
+
+- Ops Defined: Watch, learn and fix
+- Primary services: Monitoring and Logging
+- Monitoring dashboads for all metrics, including health and services (SLOs)
+- Logs can be exported, discarded, or ingested
+- SRE depends on ops
+- Error alerting pinpoints problems, quickly
+
+Scratch:
+- Metric query and tracing analysis
+- Establish performance and reliability indicators
+- Trigger alerts and error reporting
+- Logging Features
+- Error Reporting
+- SRE Tracking (SLI/SLO)
+- Performance Management
+
+
+#### Clarifying the Stackdriver/Operations Connection
+
+- 2012 - Stackdriver Created
+- 2014 - Stackdriver Acquired by Google
+- 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
+- 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name
+
+**Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs** (formerly all called Stackdriver <service>)
+
+"StackDriver" lives on - in the exam only
+
+Integration + Upgrades
+
+- Complete UI Integrations
+  - All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
+- Dashboard API
+  - New API added to allow creation and sharing of dashoards across projects
+- Log Retention Increased
+  - Logs can now be retained for up to **10 years** and you have control over the time specified
+- Metrics Enhancement
+  - In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
+- Advanced Alert Routing
+  - Alerts can now be routed to independent systems that support Cloud Pub/Sub
+
+#### Operations and SRE: How Do They Relate?
+
+- Lots of questions in Exam on SRE
+
+What is SRE? - _"SRE is what happens when a software engineer is tasked with what used to be called operations"_ (Founder Google SRE Team)
+
+**Pillars of DevOps**
+
+- Accept failure as normal:
+  - Try to anticipate, but...
+  - Incidents bound to occur
+  - Failures help team learn
+
+- No-fault postmortems &  SLOs:
+  - No two failures the same
+  - Track incidents (SLIs)
+  - Map to Objectives (SLOs)
+
+- Implement gradual change:
+  - Small updates are better
+  - Easier to review
+  - Easier to rollback
+
+- Reduce costs of failures:
+  - Limited "canary" rollouts
+  - Impact fewest users
+  - Automate where possible
+
+- Measure everything:
+  - Critical guage of sucess
+  - CI/CD needs full monitoring
+  - Synthetic, proactive monitoring
+
+- Measure toil and reliability:
+  - Key to SLOs and SLAs
+  - Reduce toil, up engineering
+  - Monitor all over time
+
+<hr style="height:2px;border-width:0;color:gray;background-color:gray">
+
+SLI: _"A carefully defined <u>quantitative</u> measure of some aspect of the level of service that is provided"_
+
+SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing
+
+Example SLIs:
+- Request Latency: How long it takes to return a response to a request
+- Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
+- Batch Throughput - Proportion of time = data processing rate > than a threshold
+
+**Commit to Memory - Google's 4x Golden Signals!**
+
+- Latency
+  - The time is takes for your service to fulfill a request
+- Errors    
+  - The rate at which your service fails
+- Traffic
+  - How much demand is directed at your service
+- Saturation
+  - A measure of how close to fully utilized the services' resources are
+
+> **LETS**
+
+<hr style="height:2px;border-width:0;color:gray;background-color:gray">
+
+SLO: _"Service level objectives (SLOs) specify a target level for the reliability of your service"_ - The site reliability workbook
+
+SLOs are tied to you SLIs
+- Measured by SLLI
+- Can be a single target value or range of values
+- SLIs <= SLO
+- or
+- (lower bound <= SLI <= upper bound) = SLO
+- Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)
+
+SLI - Metric over time which detail the health of a service
+  - example: `Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile`
+
+SLO - Agreed-upon bounds how often SLIs must be met
+  - example: `95% percentile homepage SLI will suceed 99.9% of the time over the next year`
+
+Phases of Service Lifetime
+
+SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:
+- Measure and track SLIs (Measuring increasing performance)
+- Evaluate reliability
+- Define SLOs
+- Build capacity models
+- Establish incident response, shared with dev team
+
+General Availability Phase
+- After Production Readiness Review passed
+- SREs handle majority of op work
+- Incident responses
+- Track operational load and SLOs
+
+**Ops & SRE: Key Takeaways**
+- SRE: Operations from a software engineer
+- Many shared pillars between DevOps/SRE
+- SLIs are quantitative metrics over time
+- Remember the 4x Google Golden Signals (LETS)
+- SLOs are a target objective for reliability
+- SLIs are lower then SLO - or - in-between upper and lower bound
+- SREs are most active in limited availability and general availability phases
+
+
+#### Operation Services at a Glance
+
+#### Section Review
+
+#### Milestone: The Weight of the World (Teamwork, Not Superheroes)
+
+
+Monitoring Your Operations
+Section Introduction
+Cloud Monitoring Concepts
+Monitoring Workspaces Concepts
+Monitoring Workspaces
+Perspective: Workspaces in Context
+What Are Metrics?
+Exploring Workspace and Metrics
+Monitoring Agent Concepts
+Installing the Monitoring Agent
+Collecting Monitoring Agent Metrics
+Integration with Monitoring API
+Create Dashboards with Command Line
+GKE Metrics
+Perspective: What's Up, Doc?
+Uptime Checks
+Establishing Human-Actionable and Automated Alerts
+Section Review
+Milestone: Spies Everywhere! (Check Those Vitals!)
+Hands-On Lab:
+Install and Configure Monitoring Agent with Google Cloud Monitoring
+Logging Activities
+Section Introduction
+Cloud Logging Fundamentals
+Log Types and Mechanics
+Cloud Logging Tour
+Logging Agent Concepts
+Install Logging Agent and Collect Agent Logs
+Logging Filters
+Hands-On with Advanced Filters
+VPC Flow Logs
+Firewall Logs
+VPC Flow Logs and Firewall Logs Demo
+Routing and Exporting Logs
+Export Logs to BigQuery
+Logs-Based Metrics
+Section Review
+Milestone: Let the Record Show
+Hands-On Lab:
+Install and Configure Logging Agent on Google Cloud
+SRE and Alerting Policies
+SLOs and Alerting Strategy
+Service Monitoring
+Milestone: Come Together, Right Now, SRE
+Optimize Performance with Trace/Profiler
+Section Introduction
+What the Services Do and Why They Matter
+Tracking Latency with Cloud Trace
+Accessing the Cloud Trace APIs
+Setting Up Your App with Cloud Profiler
+Analyzing Cloud Profiler Data
+Section Review
+Milestone: It All Adds Up!
+Hands-On Lab:
+Discovering Latency with Google Cloud Trace
+Identifying Application Errors with Debug/Error Reporting
+Section Introduction
+Troubleshooting with Cloud Debugger
+Establishing Error Reporting for Your App
+Managing Errors and Handling Notifications
+Section Review
+Milestone: Come Together - Reprise (Debug Is De Solution)
+Hands-On Lab:
+Correcting Code with Cloud Debugger
+Course Conclusion
+Milestone: Are We There, Yet?
+landscape
+Practice Exam / Quiz:
+Google Certified Professional Cloud DevOps Engineer Exam Prep