As Establishing human-actionable alerts

This commit is contained in:
Alex Soul 2021-02-08 18:08:52 +00:00
parent c51d10402b
commit 172df4a0f0
5 changed files with 430 additions and 42 deletions

View File

@ -11,6 +11,10 @@ Google's certifications are tied to a job class analysis.
PCDE are responsible for efficient development operations that can balance service reliability and delivery speed. They are sklled at using GPC to build software delivery pipelines, deploy and monitor services, and manage and learn from incidents.
Useful Links
[1 - Medium Blog](https://sathishvj.medium.com/notes-from-my-google-cloud-professional-devops-engineer-certification-exam-60d23aca37f5)
### What is the business of Software Development?
- Alignment - Operations with Development
@ -29,7 +33,7 @@ PCDE are responsible for efficient development operations that can balance servi
VALUE/COST = ROI (Get as much value as possible for as little cost as we can)
* A 50%-good solution that people actuall have solves more problems and survives longer than a 99% solution that nobody has... Shipping is a feature. A really important feature. Your product must have it! (Joel Spolsky - Co-founder of stack overflow)
* A 50%-good solution that people actually have solves more problems and survives longer than a 99% solution that nobody has... Shipping is a feature. A really important feature. Your product must have it! (Joel Spolsky - Co-founder of stack overflow)
* The fundamental unit of software development is a code _change_
* Every change has:
@ -60,11 +64,11 @@ DevOps is a structure that naturally leads to smaller and smaller change. Devs f
* Address outages
* Recovering from backup
| Dev | Ops |
|--|--|
| Like buying the machine | Like running the machine |
| Judged by features | Judged by availability |
| - Not by quality | - Regardless of system quality |
| Dev | Ops |
| ----------------------- | ------------------------------ |
| Like buying the machine | Like running the machine |
| Judged by features | Judged by availability |
| - Not by quality | - Regardless of system quality |
### What is a DevOps Engineer?
@ -92,7 +96,7 @@ Trainer says that all these definitions are wrong.
* Develop software to automate tasks all throughout the software development cycle
* Not just Ops
* Not just CI/CD
* Defniitely also includes quality management
* Defnitely also includes quality management
* An intentional development risk manager
@ -106,21 +110,21 @@ Trainer says that all these definitions are wrong.
Scale
| Problem | Solution |
| -- | -- |
| More users than expected | Architect service for scale |
| Bad actors ( e.g. DDoS) | Design scaling into ops, too |
| Bad handling | Build in protections |
| Bad design / Assumptions | Quality control / assurance |
| Intemittent failures | Code reviews |
| Uncommon events (corner cases) | Automated testing |
| Bad failure handling | Gradual rollouts |
| Code changes | Automated CI/CD (not manual steps) |
| Config changes | Progressive rollouts e.g. canary releases (groups of users) |
| Infrastructure changes | Timely monitoring |
| | Quick response (automatic) |
| | Safe rollbacks (automatic) |
| | Minimizing impact |
| Problem | Solution |
| ------------------------------ | ----------------------------------------------------------- |
| More users than expected | Architect service for scale |
| Bad actors ( e.g. DDoS) | Design scaling into ops, too |
| Bad handling | Build in protections |
| Bad design / Assumptions | Quality control / assurance |
| Intemittent failures | Code reviews |
| Uncommon events (corner cases) | Automated testing |
| Bad failure handling | Gradual rollouts |
| Code changes | Automated CI/CD (not manual steps) |
| Config changes | Progressive rollouts e.g. canary releases (groups of users) |
| Infrastructure changes | Timely monitoring |
| | Quick response (automatic) |
| | Safe rollbacks (automatic) |
| | Minimizing impact |
Tensions
@ -149,7 +153,7 @@ Balance
Management
1.2 Manage service life cycle
- Manage a service (e.g. introduct a new service, deploy it and maintain and retire it)
- Manage a service (e.g. introduce a new service, deploy it, maintain and retire it)
- Plan for capacity (e.g. quotas and limits management - automatic/elastic scaling)
Culture

422
Part_4.md
View File

@ -205,30 +205,414 @@ General Availability Phase
#### Operation Services at a Glance
> 10,000ft view!
- Cloud Monitoring - Provides visibility into the performance, uptime, and overall health of cloud-powered applications
- Cloud Logging - Allows you to store, search, analyze, monitor and alert on log data and events from GCP
> Gives SRE's the ability to evaluate SLI's and keep on track with SLO's
Help make operations run smoother, reliabily and more efficiently
- Cloud Debugger - When and not if the system encounters issues, lets you inspect the state of a running application in real-time without interference
- Cloud Trace - Maps out exactly how your service is processing the various <u>requests</u> and responses it receives, all while tracking latency
- Cloud Profiler - Keeps an eye on your codes performance - Continuously gather CPU usage and memory-allocation information from your production applications, looking for bottlenecks
Ops Tools Working Together
**Gather Information**
- Collect signals
- Through Metrics
- Apps
- Services
- Platform
- Microservices
- Logs
- Apps
- Services
- Platform
- Trace
- Apps
**Handle Issues**
- Alerts
- Error Reporting
- SLO
**Troubleshoot**
**Display and Investigate**
- Dashboards
- Health Checks
- Log Viewer
- Service Monitoring
- Trace
- Debugger
- Profiler
5x Main Services Are:
- Cloud Monitoring
- Cloud Logging
- Cloud Debugger
- Cloud Trace
- Cloud Profiler
> All work together to gather info, manage instance and troubleshoot, by allowing you to display the collective signals and analyze them
#### Section Review
Monitoring, troubleshooting and improving applications performance.
#### Milestone: The Weight of the World (Teamwork, Not Superheroes)
### Monitoring Your Operations
#### Section Introduction
#### Cloud Monitoring Concepts
> Measures key aspects of your services
- Captures resource and application signal data
- Metrics
- Events
- Metadata
What questions does Cloud Monitoring answer?
- How well are my resources performing?
- Are my applications meeting their SLAs?
- Is something wrong that requires immediate action?
Workspace - 'Single pane of glass' for viewing and monitoring data across projects
Installed Agents (optional) - Additional application-specifc signal data
Alerts - Notify someone when something needs to be fixed
#### Monitoring Workspaces Concepts
What is a monitoring Workspace?
- GCP's organization tool for monitoring GCP and AWS resources
- All montioring tools live in a Workspace
- Uptime checks
- Dashboards
- Alerts
- Charts
```
Monitoring Workspace (within a project) ------ (One of More GCP Projects) ------< GCP Projects
```
- Workspace exists in a Host Project - Create project first, then Monitoring Workspace within the project
- Projects can only contain a single Workspace
- Workspace name same as host project (cannot be changed) - Workspace name will be same as `project_id` (worth having a project dedicated to the workspace name)
- Workspace can monitor resources in same projects and/or other projects
- Workspace can monitor multiple projects simultaneously
- Projects associated with a single workspace
- Projects can be moved from one workspace to another
- Two different workspaces can be merged into a single workspace
- Workspace access other projects' metric data, but data 'lives' in those projects
**Project Grouping Strategies** - No single correct answer
Single workspace can monitor upto 200 projects
- A single workspace for all app-related projects "app1"
- Pro: Single pane of glass for all application/resource data for single app
- Con: If need to restrict dev/prod teams from viewing data in each others projects. Too broad access
- 2x Workspaces for viewing projects in each tier e.g. Dev/Prod/QA/Test
- Limited scope so can restrict teams
- More than one place to investigate should you need to
- Single Workspace per project
- Maximum isolation
- More limited view
Monitoring Workspaces IAM roles
- Applied to workspace project to view add projects' monitoring data
- Monitoring Viewer/Editor/Admin
- Viewer = Read-only access (view metrics)
- Editor = Edit workspace, write-access to monitor console and API
- Admin = Full access including IAM roles
- Monitoring Metric Writer = Service Account role
- Permit writing data to workspace
- Does not provide read access
#### Monitoring Workspaces
- Creating a workspace
- `Operations > Monitoring > Overview`
- Select a project, and workspace is created for that project
- Adding projects to a workspace
- Within Overview > `Settings > GCP Projects Section > Add GCP Projects`
- Moving projects between workspaces
- Goto Workspace where project is located that you want to move
- `Settings > 'GCP Projects' Section > Select 3x Little dots on line of project you want to move > 'Move to another workspace'`
- If you're moving a project where you've created custom dashboards in the current workspace, these will be lost
- Merging workspaces
- Navigate into the workspace that you want to merge into (`ws-1`):
- `Settings > MERGE` and select the workspace (`ws-2`) you want to move into your current workspace (`ws-1`)
- Merge `ws-2` into `ws-1` whereby `ws-2` is deleted as part of the process
Note: Cloud playgrounds will only have a single project
#### Perspective: Workspaces in Context
How can you make all the data easier to manage? e.g. Don't dump everything into one workspace
#### What Are Metrics?
- Workspace provides visibility into everything that is happening in your GCP resources
- Within Workspace, we view information using:
- Metrics
- ...viewed in Charts...
- ...grouped in Dashboards...
What are Metrics?
- Raw data GCP uses to create charts
- Over 1000 pre-created metrics
- Can create custom metrics
- Built in Monitoring API
- OpenCensus - open source library to create metrics
- Best practice: don't create a custom metric where a default metric already exists
Anatomy of a Metric
- Value Types - metric data type
- BOOL - boolean
- INT64 - 64-bit integer
- DOUBLE - Double precision float
- STRING - a string
- Metric Kind - relation of values
- Guage - Measure specific instant in time (e.g. CPU Utilization)
- Delta - Measure change since last recording (e.g. requests count since last data point)
#### Exploring Workspace and Metrics
> Video just does a quick walkthrough of the dashboard and creating a chart with the metrics explorer
#### Monitoring Agent Concepts
- CLoud Monitoring collects lots of metrics with no additional configuration needed
- "Zero config monitoring"
- Examples: CPU utilization, network traffic
- More granular metrics can be collected using an optional monitoring agent
- Memory usage
- 3rd Party app metrics (Nginx, Apache)
- Separate agent for both monitoring and logging
- Monitoring Agent = collectd
- Logging Agent = fluentd
Which GCP (and AWS) Services Need Agents?
- Not all compute services require (or even allow installation of) agents
- Services that support agent installation:
- Compute Engine
- AWS EC2 instances (required for all metrics)
- Services which don't support agent installation:
- <u>Everything else</u>
- Managed services either have an agent already installed or simply don't require one
- Google Kubernetes Engine has Cloud Operations for GKE pre-installed
Installing the Agent - General Process
- Add installation location as repo
- Update repos
- Install agen from repo ('apt install ...')
- (Optional) Configure agent for 3rd party application
#### Installing the Monitoring Agent
- Manually install and configure monitoring agent on Apache web server
- Demonstrate how to automate the agent setup process
The below commands will create an instance with a custom web page. The first instance does not have the agent installed, and the second instance is with the agent installed. You will need to allow port 80 on your VPC firewall if it's not already enabled, scoped to your instance tag (command included for reference).
Create the firewall rule, allowing port 80 access on the default VPC (modify if using a custom VPC):
```
gcloud compute firewall-rules create default-allow-http --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:80 --source-ranges=0.0.0.0/0 --target-tags=http-server
```
Create a web server without an agent:
```
gcloud beta compute instances create website-agent --zone=us-central1-a --machine-type=e2-micro --metadata=startup-script-url=gs://acg-gcloud-course-resources/devops-engineer/operations/webpage-config-script.sh --tags=http-server --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=website-agent
```
Create web server and install an agent: (The Automated Example)
```
gcloud beta compute instances create website-agent --zone=us-central1-a --machine-type=e2-micro --metadata=startup-script-url=gs://acg-gcloud-course-resources/devops-engineer/operations/webpage-config-with-agent.sh --tags=http-server --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=website-agent
```
To manually install the agent (with Apache configuration) on an instance:
- Add the agent's package repository
```
curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo bash add-monitoring-agent-repo.sh
sudo apt-get update
```
- To install the latest version of the agent, run:
```
sudo apt-get install stackdriver-agent
```
- To verify that the agent is working as expected, run:
```
sudo service stackdriver-agent status
```
- On your VM instance, download apache.conf and place it in the directory /opt/stackdriver/collectd/etc/collectd.d/:
```
(cd /opt/stackdriver/collectd/etc/collectd.d/ && sudo curl -O https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/apache.conf)
```
- Restart the monitoring agent:
```
sudo service stackdriver-agent restart
```
#### Collecting Monitoring Agent Metrics
- Generate traffic to our Apache web server
- View web server metrics in Cloud Monitoring Workspace
> When you have the agent installed, you can see 2x new tabs depending on what's been configured. In the example "Agent" & "Apache" are visible with associated metrics that wouldn't otherwise be there
#### Integration with Monitoring API
What is the Monitoring API?
- Manipulate metrics utilizing Google Cloud API's
- Accessible via external services
- Monitoring dashboards in Grafana
- Export metric data to BigQuery
- Create and share dashboards with programmatic REST and gRPC syntax
- Efficient than hand-created dashboards from scratch
External Integration Use Cases
- Keep metrics for long-term analysis/storage
- Cloud Monitoring holds metrics for six weeks
- Share metric data with other analytic platforms (e.g. Grafana)
How does Metrics Integration Work?
- Short version: Export/Integrate via Cloud Monitoring API
- If exporting metrics:
- Define metrics with metric descriptor (JSON format)
- Export via Monitoring API to BigQuery
- If using 3rd Party service (e.g. Grafana), authenticate to Cloud Monitoring API with service account
![Example BigQuery Export](./img/example_big_query_export.png)
<br>
![Metric Descriptor json](./img/metric_json_descriptor.png)
<br>
![HTTPS REST API call for Grafana](./img/api_format_grafana.png)
Programmatically Create Dashboards
- Create and export dashboard via Monitoring API
- Dashboard configuration represented in JSON format
- Create dashboards and configurations via `gcloud` command or directly via REST API
- `gcloud monitoring dashboards create --config-from-file=[file_name.json]`
#### Create Dashboards with Command Line
- Create custom dashboards via command with JSON config files
- Use both gcloud command or directly via REST API
- Export current dashboard into JSON configuration file
> This is cool, because we can create and hold our dashboards under source control
- Download configuration file
```
wget https://raw.githubusercontent.com/GoogleCloudPlatform/monitoring-dashboard-samples/master/dashboards/compute/gce-vm-instance-monitoring.json
```
- Create dashboard with gcloud command and config file:
```
gcloud monitoring dashboards create --config-from-file=gce-vm-instance-monitoring.json
```
- Do same thing via REST API
```
curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
https://monitoring.googleapis.com/v1/projects/(YOUR-PROJECT-ID-HERE)/dashboards -d @gce-vm-instance-monitoring.json
```
- Export current dashboard
Create shell variables for Project ID and Project Number
```
export PROJECT_ID=$(gcloud config list --format 'value(core.project)')
export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)")
```
- Export dashboard using above variables. Substitute your dashboard ID and exported file name where appropriate
Export current dashboard by dashboard ID
```
gcloud monitoring dashboards describe \
projects/$PROJECT_NUMBER/dashboards/$DASH_ID --format=json > your-file.json
```
#### GKE Metrics
- "Enable Cloud Operations for GKE" - Simple tickbox to turn on
_ GKE natively integrated with both Cloud Monitoring and Logging
- Ability to toggle 'Cloud Operations for GKE' in cluster settings, enabled by default
- Cloud Operations for GKE replaces older 'Legacy Monitoring and Logging'
- Integrates with Prometheus
What K8's Metrics are Collected?
- k8s_master_component
- k8s_cluster
- k8s_node
- k8s_pod
- k8s_container
If you need a 'one-click' script to build out a demonstration GKE web application from scratch, copy and paste the below command. It will download and execute a script to build out the environment. The process will take about five minutes to complete:
```
wget https://raw.githubusercontent.com/linuxacademy/content-gcpro-devops-engineer/master/scripts/quick-deploy-to-gke.sh
source quick-deploy-to-gke.sh
```
#### Perspective: What's Up, Doc?
- Uptime checks are very valuable
#### Uptime Checks
What are Uptime Checks?
- Periodic request sent to a monitor resource and waits for a response (or is "up")
- Check uptime of:
- VMs
- App Engine services
- Website URLs
- AWS Load Balancer
- Create uptime check via Cloud Monitoring
- Optionally, create an alert to notify is uptime check fails
- IMPORTANT: Uptime checks are subject to firewall access
- Allow access to the uptime check IP range
#### Establishing Human-Actionable and Automated Alerts
Why do we care about Alerts?
- Sometimes, things break
- No one wants to endlessly stare at dashboards for something to go wrong
- Solution: Use Alerting Policy to notify you if something goes wrong
#
#### Section Review
#### Milestone: Spies Everywhere! (Check Those Vitals!)
Monitoring Your Operations
Section Introduction
Cloud Monitoring Concepts
Monitoring Workspaces Concepts
Monitoring Workspaces
Perspective: Workspaces in Context
What Are Metrics?
Exploring Workspace and Metrics
Monitoring Agent Concepts
Installing the Monitoring Agent
Collecting Monitoring Agent Metrics
Integration with Monitoring API
Create Dashboards with Command Line
GKE Metrics
Perspective: What's Up, Doc?
Uptime Checks
Establishing Human-Actionable and Automated Alerts
Section Review
Milestone: Spies Everywhere! (Check Those Vitals!)
Hands-On Lab:
Install and Configure Monitoring Agent with Google Cloud Monitoring
Logging Activities

BIN
img/api_format_grafana.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 213 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 113 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 230 KiB