- _"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"_
- Logging Management
- Gather Logs, metrics and traces everywhere
- Audit, Platform, User logs
- Export, Discard, Ingest
- Error Reporting
- So much data, How do you pick out the important indicators?
- A centralized error management interface that shows current & past errors
- Identify your app's top or new errors at a glance, in a dedicated dashboard
- Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
- Across-the-board Monitoring
- Dashboards for built-in and customizable visualisations
- Monitoring Features
- Visual Dashboards
- Health Monitoring
- Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
- Service Monitoring
- Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
- SRE Tracking
- Monitoring is critical for SRE
- Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
- Pinpoint SLI's and develop and SLO on top of it
- Operational Management
- Debugging
- Inspects the state of your application at any code location in production without stopping or slowing down requests
- Latency Management
- Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
- Performance Management
- Offers continuous profiling of resource consumption in your production applications along with cost management
- Security Management
- With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud
**What is Ops: Key Takeaways**
- Ops Defined: Watch, learn and fix
- Primary services: Monitoring and Logging
- Monitoring dashboads for all metrics, including health and services (SLOs)
- Logs can be exported, discarded, or ingested
- SRE depends on ops
- Error alerting pinpoints problems, quickly
Scratch:
- Metric query and tracing analysis
- Establish performance and reliability indicators
- Trigger alerts and error reporting
- Logging Features
- Error Reporting
- SRE Tracking (SLI/SLO)
- Performance Management
#### Clarifying the Stackdriver/Operations Connection
- 2012 - Stackdriver Created
- 2014 - Stackdriver Acquired by Google
- 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
- 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name
**Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs** (formerly all called Stackdriver <service>)
"StackDriver" lives on - in the exam only
Integration + Upgrades
- Complete UI Integrations
- All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
- Dashboard API
- New API added to allow creation and sharing of dashoards across projects
- Log Retention Increased
- Logs can now be retained for up to **10 years** and you have control over the time specified
- Metrics Enhancement
- In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
- Advanced Alert Routing
- Alerts can now be routed to independent systems that support Cloud Pub/Sub
#### Operations and SRE: How Do They Relate?
- Lots of questions in Exam on SRE
What is SRE? - _"SRE is what happens when a software engineer is tasked with what used to be called operations"_ (Founder Google SRE Team)
SLO: _"Service level objectives (SLOs) specify a target level for the reliability of your service"_ - The site reliability workbook
SLOs are tied to you SLIs
- Measured by SLLI
- Can be a single target value or range of values
- SLIs <= SLO
- or
- (lower bound <= SLI <= upper bound) = SLO
- Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)
SLI - Metric over time which detail the health of a service
- example: `Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile`
SLO - Agreed-upon bounds how often SLIs must be met
- example: `95% percentile homepage SLI will suceed 99.9% of the time over the next year`
Phases of Service Lifetime
SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:
- Measure and track SLIs (Measuring increasing performance)
- Evaluate reliability
- Define SLOs
- Build capacity models
- Establish incident response, shared with dev team
General Availability Phase
- After Production Readiness Review passed
- SREs handle majority of op work
- Incident responses
- Track operational load and SLOs
**Ops & SRE: Key Takeaways**
- SRE: Operations from a software engineer
- Many shared pillars between DevOps/SRE
- SLIs are quantitative metrics over time
- Remember the 4x Google Golden Signals (LETS)
- SLOs are a target objective for reliability
- SLIs are lower then SLO - or - in-between upper and lower bound
- SREs are most active in limited availability and general availability phases
- Cloud Monitoring - Provides visibility into the performance, uptime, and overall health of cloud-powered applications
- Cloud Logging - Allows you to store, search, analyze, monitor and alert on log data and events from GCP
> Gives SRE's the ability to evaluate SLI's and keep on track with SLO's
Help make operations run smoother, reliabily and more efficiently
- Cloud Debugger - When and not if the system encounters issues, lets you inspect the state of a running application in real-time without interference
- Cloud Trace - Maps out exactly how your service is processing the various <u>requests</u> and responses it receives, all while tracking latency
- Cloud Profiler - Keeps an eye on your codes performance - Continuously gather CPU usage and memory-allocation information from your production applications, looking for bottlenecks
Ops Tools Working Together
**Gather Information**
- Collect signals
- Through Metrics
- Apps
- Services
- Platform
- Microservices
- Logs
- Apps
- Services
- Platform
- Trace
- Apps
**Handle Issues**
- Alerts
- Error Reporting
- SLO
**Troubleshoot**
**Display and Investigate**
- Dashboards
- Health Checks
- Log Viewer
- Service Monitoring
- Trace
- Debugger
- Profiler
5x Main Services Are:
- Cloud Monitoring
- Cloud Logging
- Cloud Debugger
- Cloud Trace
- Cloud Profiler
> All work together to gather info, manage instance and troubleshoot, by allowing you to display the collective signals and analyze them
- Is something wrong that requires immediate action?
Workspace - 'Single pane of glass' for viewing and monitoring data across projects
Installed Agents (optional) - Additional application-specifc signal data
Alerts - Notify someone when something needs to be fixed
#### Monitoring Workspaces Concepts
What is a monitoring Workspace?
- GCP's organization tool for monitoring GCP and AWS resources
- All montioring tools live in a Workspace
- Uptime checks
- Dashboards
- Alerts
- Charts
```
Monitoring Workspace (within a project) ------ (One of More GCP Projects) ------<GCPProjects
```
- Workspace exists in a Host Project - Create project first, then Monitoring Workspace within the project
- Projects can only contain a single Workspace
- Workspace name same as host project (cannot be changed) - Workspace name will be same as `project_id` (worth having a project dedicated to the workspace name)
- Workspace can monitor resources in same projects and/or other projects
- Workspace can monitor multiple projects simultaneously
- Projects associated with a single workspace
- Projects can be moved from one workspace to another
- Two different workspaces can be merged into a single workspace
- Workspace access other projects' metric data, but data 'lives' in those projects
**Project Grouping Strategies** - No single correct answer
Single workspace can monitor upto 200 projects
- A single workspace for all app-related projects "app1"
- Pro: Single pane of glass for all application/resource data for single app
- Con: If need to restrict dev/prod teams from viewing data in each others projects. Too broad access
- 2x Workspaces for viewing projects in each tier e.g. Dev/Prod/QA/Test
- Limited scope so can restrict teams
- More than one place to investigate should you need to
- Single Workspace per project
- Maximum isolation
- More limited view
Monitoring Workspaces IAM roles
- Applied to workspace project to view add projects' monitoring data
- Monitoring Viewer/Editor/Admin
- Viewer = Read-only access (view metrics)
- Editor = Edit workspace, write-access to monitor console and API
- Admin = Full access including IAM roles
- Monitoring Metric Writer = Service Account role
- Permit writing data to workspace
- Does not provide read access
#### Monitoring Workspaces
- Creating a workspace
-`Operations > Monitoring > Overview`
- Select a project, and workspace is created for that project
- Goto Workspace where project is located that you want to move
-`Settings > 'GCP Projects' Section > Select 3x Little dots on line of project you want to move > 'Move to another workspace'`
- If you're moving a project where you've created custom dashboards in the current workspace, these will be lost
- Merging workspaces
- Navigate into the workspace that you want to merge into (`ws-1`):
-`Settings > MERGE` and select the workspace (`ws-2`) you want to move into your current workspace (`ws-1`)
- Merge `ws-2` into `ws-1` whereby `ws-2` is deleted as part of the process
Note: Cloud playgrounds will only have a single project
#### Perspective: Workspaces in Context
How can you make all the data easier to manage? e.g. Don't dump everything into one workspace
#### What Are Metrics?
- Workspace provides visibility into everything that is happening in your GCP resources
- Within Workspace, we view information using:
- Metrics
- ...viewed in Charts...
- ...grouped in Dashboards...
What are Metrics?
- Raw data GCP uses to create charts
- Over 1000 pre-created metrics
- Can create custom metrics
- Built in Monitoring API
- OpenCensus - open source library to create metrics
- Best practice: don't create a custom metric where a default metric already exists
Anatomy of a Metric
- Value Types - metric data type
- BOOL - boolean
- INT64 - 64-bit integer
- DOUBLE - Double precision float
- STRING - a string
- Metric Kind - relation of values
- Guage - Measure specific instant in time (e.g. CPU Utilization)
- Delta - Measure change since last recording (e.g. requests count since last data point)
#### Exploring Workspace and Metrics
> Video just does a quick walkthrough of the dashboard and creating a chart with the metrics explorer
#### Monitoring Agent Concepts
- CLoud Monitoring collects lots of metrics with no additional configuration needed
- "Zero config monitoring"
- Examples: CPU utilization, network traffic
- More granular metrics can be collected using an optional monitoring agent
- Memory usage
- 3rd Party app metrics (Nginx, Apache)
- Separate agent for both monitoring and logging
- Monitoring Agent = collectd
- Logging Agent = fluentd
Which GCP (and AWS) Services Need Agents?
- Not all compute services require (or even allow installation of) agents
- Services that support agent installation:
- Compute Engine
- AWS EC2 instances (required for all metrics)
- Services which don't support agent installation:
-<u>Everything else</u>
- Managed services either have an agent already installed or simply don't require one
- Google Kubernetes Engine has Cloud Operations for GKE pre-installed
Installing the Agent - General Process
- Add installation location as repo
- Update repos
- Install agen from repo ('apt install ...')
- (Optional) Configure agent for 3rd party application
#### Installing the Monitoring Agent
- Manually install and configure monitoring agent on Apache web server
- Demonstrate how to automate the agent setup process
The below commands will create an instance with a custom web page. The first instance does not have the agent installed, and the second instance is with the agent installed. You will need to allow port 80 on your VPC firewall if it's not already enabled, scoped to your instance tag (command included for reference).
Create the firewall rule, allowing port 80 access on the default VPC (modify if using a custom VPC):
- View web server metrics in Cloud Monitoring Workspace
> When you have the agent installed, you can see 2x new tabs depending on what's been configured. In the example "Agent" & "Apache" are visible with associated metrics that wouldn't otherwise be there
#### Integration with Monitoring API
What is the Monitoring API?
- Manipulate metrics utilizing Google Cloud API's
- Accessible via external services
- Monitoring dashboards in Grafana
- Export metric data to BigQuery
- Create and share dashboards with programmatic REST and gRPC syntax
- Efficient than hand-created dashboards from scratch
External Integration Use Cases
- Keep metrics for long-term analysis/storage
- Cloud Monitoring holds metrics for six weeks
- Share metric data with other analytic platforms (e.g. Grafana)
How does Metrics Integration Work?
- Short version: Export/Integrate via Cloud Monitoring API
- If exporting metrics:
- Define metrics with metric descriptor (JSON format)
- Export via Monitoring API to BigQuery
- If using 3rd Party service (e.g. Grafana), authenticate to Cloud Monitoring API with service account
- "Enable Cloud Operations for GKE" - Simple tickbox to turn on
_ GKE natively integrated with both Cloud Monitoring and Logging
- Ability to toggle 'Cloud Operations for GKE' in cluster settings, enabled by default
- Cloud Operations for GKE replaces older 'Legacy Monitoring and Logging'
- Integrates with Prometheus
What K8's Metrics are Collected?
- k8s_master_component
- k8s_cluster
- k8s_node
- k8s_pod
- k8s_container
If you need a 'one-click' script to build out a demonstration GKE web application from scratch, copy and paste the below command. It will download and execute a script to build out the environment. The process will take about five minutes to complete: