Training/gpcdesre

Fork 0

Alex Soul 9765410f11 Passed exam, w00t

2021-02-23 09:53:47 +00:00

52 KiB

Raw Permalink Blame History

Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)

Introduction

About the Course and Learning Path

Make better software, faster

Milestone: Getting Started

Understanding Operations in Context

Section Introduction

What Is Ops?

GCP Defined
- "Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"
- Logging Management
  - Gather Logs, metrics and traces everywhere
    - Audit, Platform, User logs
      - Export, Discard, Ingest
- Error Reporting
  - So much data, How do you pick out the important indicators?
    - A centralized error management interface that shows current & past errors
    - Identify your app's top or new errors at a glance, in a dedicated dashboard
    - Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
- Across-the-board Monitoring
  - Dashboards for built-in and customizable visualisations
    - Monitoring Features
      - Visual Dashboards
  - Health Monitoring
    - Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
  - Service Monitoring
    - Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
- SRE Tracking
  - Monitoring is critical for SRE
    - Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
      - Pinpoint SLI's and develop and SLO on top of it
- Operational Management
  - Debugging
    - Inspects the state of your application at any code location in production without stopping or slowing down requests
  - Latency Management
    - Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
  - Performance Management
    - Offers continuous profiling of resource consumption in your production applications along with cost management
  - Security Management
    - With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud

What is Ops: Key Takeaways

Ops Defined: Watch, learn and fix
Primary services: Monitoring and Logging
Monitoring dashboads for all metrics, including health and services (SLOs)
Logs can be exported, discarded, or ingested
SRE depends on ops
Error alerting pinpoints problems, quickly

Scratch:

Metric query and tracing analysis
Establish performance and reliability indicators
Trigger alerts and error reporting
Logging Features
Error Reporting
SRE Tracking (SLI/SLO)
Performance Management

Clarifying the Stackdriver/Operations Connection

2012 - Stackdriver Created
2014 - Stackdriver Acquired by Google
2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name

Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs (formerly all called Stackdriver )

"StackDriver" lives on - in the exam only

Integration + Upgrades

Complete UI Integrations
- All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
Dashboard API
- New API added to allow creation and sharing of dashoards across projects
Log Retention Increased
- Logs can now be retained for up to 10 years and you have control over the time specified
Metrics Enhancement
- In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
Advanced Alert Routing
- Alerts can now be routed to independent systems that support Cloud Pub/Sub

Operations and SRE: How Do They Relate?

Lots of questions in Exam on SRE

What is SRE? - "SRE is what happens when a software engineer is tasked with what used to be called operations" (Founder Google SRE Team)

Pillars of DevOps

Accept failure as normal:
- Try to anticipate, but...
- Incidents bound to occur
- Failures help team learn
No-fault postmortems & SLOs:
- No two failures the same
- Track incidents (SLIs)
- Map to Objectives (SLOs)
Implement gradual change:
- Small updates are better
- Easier to review
- Easier to rollback
Reduce costs of failures:
- Limited "canary" rollouts
- Impact fewest users
- Automate where possible
Measure everything:
- Critical guage of sucess
- CI/CD needs full monitoring
- Synthetic, proactive monitoring
Measure toil and reliability:
- Key to SLOs and SLAs
- Reduce toil, up engineering
- Monitor all over time

SLI: "A carefully defined quantitative measure of some aspect of the level of service that is provided"

SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing

Example SLIs:

Request Latency: How long it takes to return a response to a request
Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
Batch Throughput - Proportion of time = data processing rate > than a threshold

Commit to Memory - Google's 4x Golden Signals!

Latency
- The time is takes for your service to fulfill a request
Errors
- The rate at which your service fails
Traffic
- How much demand is directed at your service
Saturation
- A measure of how close to fully utilized the services' resources are

LETS

SLO: "Service level objectives (SLOs) specify a target level for the reliability of your service" - The site reliability workbook

SLOs are tied to you SLIs

Measured by SLLI
Can be a single target value or range of values
SLIs <= SLO
or
(lower bound <= SLI <= upper bound) = SLO
Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)

SLI - Metric over time which detail the health of a service

example: Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile

SLO - Agreed-upon bounds how often SLIs must be met

example: 95% percentile homepage SLI will suceed 99.9% of the time over the next year

Phases of Service Lifetime

SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:

Measure and track SLIs (Measuring increasing performance)
Evaluate reliability
Define SLOs
Build capacity models
Establish incident response, shared with dev team

General Availability Phase

After Production Readiness Review passed
SREs handle majority of op work
Incident responses
Track operational load and SLOs

Ops & SRE: Key Takeaways

SRE: Operations from a software engineer
Many shared pillars between DevOps/SRE
SLIs are quantitative metrics over time
Remember the 4x Google Golden Signals (LETS)
SLOs are a target objective for reliability
SLIs are lower then SLO - or - in-between upper and lower bound
SREs are most active in limited availability and general availability phases

Operation Services at a Glance

10,000ft view!

Cloud Monitoring - Provides visibility into the performance, uptime, and overall health of cloud-powered applications
Cloud Logging - Allows you to store, search, analyze, monitor and alert on log data and events from GCP

Gives SRE's the ability to evaluate SLI's and keep on track with SLO's

Help make operations run smoother, reliabily and more efficiently

Cloud Debugger - When and not if the system encounters issues, lets you inspect the state of a running application in real-time without interference
Cloud Trace - Maps out exactly how your service is processing the various requests and responses it receives, all while tracking latency
Cloud Profiler - Keeps an eye on your codes performance - Continuously gather CPU usage and memory-allocation information from your production applications, looking for bottlenecks

Ops Tools Working Together

Gather Information

Collect signals
- Through Metrics
  - Apps
  - Services
  - Platform
  - Microservices
- Logs
  - Apps
  - Services
  - Platform
- Trace
  - Apps

Handle Issues

Alerts
Error Reporting
SLO

Troubleshoot

Display and Investigate

Dashboards
Health Checks
Log Viewer
Service Monitoring
Trace
Debugger
Profiler

5x Main Services Are:

Cloud Monitoring
Cloud Logging
Cloud Debugger
Cloud Trace
Cloud Profiler

All work together to gather info, manage instance and troubleshoot, by allowing you to display the collective signals and analyze them

Section Review

Monitoring, troubleshooting and improving applications performance.

Milestone: The Weight of the World (Teamwork, Not Superheroes)

Monitoring Your Operations

Section Introduction

Cloud Monitoring Concepts

Measures key aspects of your services

Captures resource and application signal data
- Metrics
- Events
- Metadata

What questions does Cloud Monitoring answer?

How well are my resources performing?
Are my applications meeting their SLAs?
Is something wrong that requires immediate action?

Workspace - 'Single pane of glass' for viewing and monitoring data across projects Installed Agents (optional) - Additional application-specifc signal data Alerts - Notify someone when something needs to be fixed

Monitoring Workspaces Concepts

What is a monitoring Workspace?

GCP's organization tool for monitoring GCP and AWS resources
- All montioring tools live in a Workspace
  - Uptime checks
  - Dashboards
  - Alerts
  - Charts

Monitoring Workspace (within a project) ------ (One of More GCP Projects) ------< GCP Projects

Workspace exists in a Host Project - Create project first, then Monitoring Workspace within the project
Projects can only contain a single Workspace
Workspace name same as host project (cannot be changed) - Workspace name will be same as project_id (worth having a project dedicated to the workspace name)
Workspace can monitor resources in same projects and/or other projects
Workspace can monitor multiple projects simultaneously
Projects associated with a single workspace
Projects can be moved from one workspace to another
Two different workspaces can be merged into a single workspace
Workspace access other projects' metric data, but data 'lives' in those projects

Project Grouping Strategies - No single correct answer

Single workspace can monitor upto 200 projects

A single workspace for all app-related projects "app1"
- Pro: Single pane of glass for all application/resource data for single app
- Con: If need to restrict dev/prod teams from viewing data in each others projects. Too broad access
2x Workspaces for viewing projects in each tier e.g. Dev/Prod/QA/Test
- Limited scope so can restrict teams
- More than one place to investigate should you need to
Single Workspace per project
- Maximum isolation
- More limited view

Monitoring Workspaces IAM roles

Applied to workspace project to view add projects' monitoring data
Monitoring Viewer/Editor/Admin
- Viewer = Read-only access (view metrics)
- Editor = Edit workspace, write-access to monitor console and API
- Admin = Full access including IAM roles
Monitoring Metric Writer = Service Account role
- Permit writing data to workspace
- Does not provide read access

Monitoring Workspaces

Creating a workspace
- Operations > Monitoring > Overview
- Select a project, and workspace is created for that project
Adding projects to a workspace
- Within Overview > Settings > GCP Projects Section > Add GCP Projects
Moving projects between workspaces
- Goto Workspace where project is located that you want to move
  - Settings > 'GCP Projects' Section > Select 3x Little dots on line of project you want to move > 'Move to another workspace'
    - If you're moving a project where you've created custom dashboards in the current workspace, these will be lost
Merging workspaces
- Navigate into the workspace that you want to merge into (ws-1):
  - Settings > MERGE and select the workspace (ws-2) you want to move into your current workspace (ws-1)
    - Merge ws-2 into ws-1 whereby ws-2 is deleted as part of the process

Note: Cloud playgrounds will only have a single project

Perspective: Workspaces in Context

How can you make all the data easier to manage? e.g. Don't dump everything into one workspace

What Are Metrics?

Workspace provides visibility into everything that is happening in your GCP resources
Within Workspace, we view information using:
- Metrics
- ...viewed in Charts...
- ...grouped in Dashboards...

What are Metrics?

Raw data GCP uses to create charts
Over 1000 pre-created metrics
Can create custom metrics
- Built in Monitoring API
- OpenCensus - open source library to create metrics
Best practice: don't create a custom metric where a default metric already exists

Anatomy of a Metric

Value Types - metric data type
- BOOL - boolean
- INT64 - 64-bit integer
- DOUBLE - Double precision float
- STRING - a string
Metric Kind - relation of values
- Guage - Measure specific instant in time (e.g. CPU Utilization)
- Delta - Measure change since last recording (e.g. requests count since last data point)

Exploring Workspace and Metrics

Video just does a quick walkthrough of the dashboard and creating a chart with the metrics explorer

Monitoring Agent Concepts

CLoud Monitoring collects lots of metrics with no additional configuration needed
- "Zero config monitoring"
- Examples: CPU utilization, network traffic
More granular metrics can be collected using an optional monitoring agent
- Memory usage
- 3rd Party app metrics (Nginx, Apache)
Separate agent for both monitoring and logging
Monitoring Agent = collectd
Logging Agent = fluentd

Which GCP (and AWS) Services Need Agents?

Not all compute services require (or even allow installation of) agents
Services that support agent installation:
- Compute Engine
- AWS EC2 instances (required for all metrics)
Services which don't support agent installation:
- Everything else
- Managed services either have an agent already installed or simply don't require one
- Google Kubernetes Engine has Cloud Operations for GKE pre-installed

Installing the Agent - General Process

Add installation location as repo
Update repos
Install agen from repo ('apt install ...')
(Optional) Configure agent for 3rd party application

Installing the Monitoring Agent

Manually install and configure monitoring agent on Apache web server
Demonstrate how to automate the agent setup process

The below commands will create an instance with a custom web page. The first instance does not have the agent installed, and the second instance is with the agent installed. You will need to allow port 80 on your VPC firewall if it's not already enabled, scoped to your instance tag (command included for reference).

Create the firewall rule, allowing port 80 access on the default VPC (modify if using a custom VPC):

gcloud compute firewall-rules create default-allow-http --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:80 --source-ranges=0.0.0.0/0 --target-tags=http-server

Create a web server without an agent:

gcloud beta compute instances create website-agent --zone=us-central1-a --machine-type=e2-micro --metadata=startup-script-url=gs://acg-gcloud-course-resources/devops-engineer/operations/webpage-config-script.sh --tags=http-server --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=website-agent

Create web server and install an agent: (The Automated Example)

gcloud beta compute instances create website-agent --zone=us-central1-a --machine-type=e2-micro --metadata=startup-script-url=gs://acg-gcloud-course-resources/devops-engineer/operations/webpage-config-with-agent.sh --tags=http-server --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=website-agent

To manually install the agent (with Apache configuration) on an instance:

Add the agent's package repository

curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo bash add-monitoring-agent-repo.sh
sudo apt-get update

To install the latest version of the agent, run:

sudo apt-get install stackdriver-agent

To verify that the agent is working as expected, run:

sudo service stackdriver-agent status

On your VM instance, download apache.conf and place it in the directory /opt/stackdriver/collectd/etc/collectd.d/:

(cd /opt/stackdriver/collectd/etc/collectd.d/ && sudo curl -O https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/apache.conf)

Restart the monitoring agent:

sudo service stackdriver-agent restart

Collecting Monitoring Agent Metrics

Generate traffic to our Apache web server
View web server metrics in Cloud Monitoring Workspace

When you have the agent installed, you can see 2x new tabs depending on what's been configured. In the example "Agent" & "Apache" are visible with associated metrics that wouldn't otherwise be there

Integration with Monitoring API

What is the Monitoring API?

Manipulate metrics utilizing Google Cloud API's
Accessible via external services
- Monitoring dashboards in Grafana
- Export metric data to BigQuery
Create and share dashboards with programmatic REST and gRPC syntax
- Efficient than hand-created dashboards from scratch

External Integration Use Cases

Keep metrics for long-term analysis/storage
- Cloud Monitoring holds metrics for six weeks
Share metric data with other analytic platforms (e.g. Grafana)

How does Metrics Integration Work?

Short version: Export/Integrate via Cloud Monitoring API
If exporting metrics:
- Define metrics with metric descriptor (JSON format)
- Export via Monitoring API to BigQuery
If using 3rd Party service (e.g. Grafana), authenticate to Cloud Monitoring API with service account

Programmatically Create Dashboards

Create and export dashboard via Monitoring API
Dashboard configuration represented in JSON format
Create dashboards and configurations via gcloud command or directly via REST API
gcloud monitoring dashboards create --config-from-file=[file_name.json]

Create Dashboards with Command Line

Create custom dashboards via command with JSON config files
- Use both gcloud command or directly via REST API
Export current dashboard into JSON configuration file

This is cool, because we can create and hold our dashboards under source control

Download configuration file

wget https://raw.githubusercontent.com/GoogleCloudPlatform/monitoring-dashboard-samples/master/dashboards/compute/gce-vm-instance-monitoring.json

Create dashboard with gcloud command and config file:

gcloud monitoring dashboards create --config-from-file=gce-vm-instance-monitoring.json

Do same thing via REST API

curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
https://monitoring.googleapis.com/v1/projects/(YOUR-PROJECT-ID-HERE)/dashboards -d @gce-vm-instance-monitoring.json

Export current dashboard

Create shell variables for Project ID and Project Number

export PROJECT_ID=$(gcloud config list --format 'value(core.project)')

export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)")

Export dashboard using above variables. Substitute your dashboard ID and exported file name where appropriate Export current dashboard by dashboard ID

gcloud monitoring dashboards describe \
projects/$PROJECT_NUMBER/dashboards/$DASH_ID --format=json > your-file.json

GKE Metrics

"Enable Cloud Operations for GKE" - Simple tickbox to turn on

_ GKE natively integrated with both Cloud Monitoring and Logging

Ability to toggle 'Cloud Operations for GKE' in cluster settings, enabled by default
Cloud Operations for GKE replaces older 'Legacy Monitoring and Logging'
Integrates with Prometheus

What K8's Metrics are Collected?

k8s_master_component
k8s_cluster
k8s_node
k8s_pod
k8s_container

If you need a 'one-click' script to build out a demonstration GKE web application from scratch, copy and paste the below command. It will download and execute a script to build out the environment. The process will take about five minutes to complete:

wget https://raw.githubusercontent.com/linuxacademy/content-gcpro-devops-engineer/master/scripts/quick-deploy-to-gke.sh

source quick-deploy-to-gke.sh

Perspective: What's Up, Doc?

Uptime checks are very valuable

Uptime Checks

What are Uptime Checks?

Periodic request sent to a monitor resource and waits for a response (or is "up")
Check uptime of:
- VMs
- App Engine services
- Website URLs
- AWS Load Balancer
Create uptime check via Cloud Monitoring
Optionally, create an alert to notify is uptime check fails
IMPORTANT: Uptime checks are subject to firewall access
- Allow access to the uptime check IP range

Establishing Human-Actionable and Automated Alerts

Why do we care about Alerts?

Sometimes, things break
No one wants to endlessly stare at dashboards for something to go wrong
Solution: Use Alerting Policy to notify you if something goes wrong

Alerting Policy Components

Conditions - describes conditions to trigger an alert
- Metrics threshold exceeded/not met
- Create an incident when thresholds are violated
Notifications - who to notify when the alerting policy is triggered
(optional) - Documentation - included in notifications with action steps

Incident Handling

Alerting event occurs when alerting policy conditions are violated
Creates Incident in Open state
Incident can then be Acknowledged (investigated) and Closed (resolved)

Alerting Policy IAM Roles

Uses Cloud Monitoring roles to create an alerting policy
Monitoring Editor, Admin, Project Owner
Monitoring Alert Policy Editor - minimal permissions to create an alert via Monitoring API

GCP creates actual incidents under alerts, that you can "resolve"

Section Review

Monitoring Your Operations

Cloud Monitoring Concepts
Monitoring Workspaces
What are Metrics?
Exploring Workspaces and Metrics
Monitoring Agent
Monitoring API and CLI usage
GKE Metrics - Master to individual containers
Uptime Checkes
Establishing Human-Actionable and Automated Alerts

Milestone: Spies Everywhere! (Check Those Vitals!)

Hands-On Lab: Install and Configure Monitoring Agent with Google Cloud Monitoring

Logging Activities

Section Introduction

Logging Activities: See next headings

Cloud Logging Fundamentals

What is Cloud Logging?

Cloud Operations service for storing, viewing, and interacting with logs:
- Reading and writing logs entries
- Query logs
- Export to other services (internal to GCP and external)
- Create metrics from logs
Interact with Logs Viewer and API
Multiple log types available
Logs used by other Cloud Operations services (debug, error reporting, etc)

What is a log?

Record of status or event (string format)
- "What happened?"
Log Entry - individual logs in a collection
Log Payload - contents of the Log Entry
- Contains nested Fields

Logs Retention Period

Varies by log type:
- Admin Activity, System Event, Access Transparency
  - 400 days
  - Non-configurable
- All other log types:
  - 30 days by default
  - Configurable retention period

IAM Roles

Generic and service account varieties
Service Account:
- Logs Writer: Write logs, no view permissions
- Logs Bucket Writer: Write logs to logs buckets
Logs Viewer - View logs except Data Access/Access Transparency (known as private logs)
Private Logs Viewer - View all of the above
Logs Configuration Writer - Create logs-based metrics, buckets, views and export sinks
- 'Change configruations'
Logging Admin - Full access to all logging actions
Project Viewer - View all logs except Data Access/Access Transparency
Project Editor 0 Write, view and delete logs. Create logs-based metrics
- Cannot create export sinks or view Data Access/Access Transparency logs
Project Owner - all logging-based permissions

Log Types and Mechanics

Scope of Collecting and Viewing Logs

Scoped by project
View project-1 logs in project-1
No built-in "single pane of glass"
Can export logs org-wide or multiple projects

Log Types - Primary Categories Security Logs vs. Non-security Logs Always Enabled (non-configurable) vs. Mnaually Enabled (configurable):

Always Enabled/REquired
- No change
- 400 days retention
Manually Enabled logs
- Charged based on log amount
- 30 days retention (configurable) Above categories overlap

Security Logs

Audit logs and Access transparency logs

"Who did what? where? and when?"
Also accessible via Activity Log

Admin Activity | System Event | Data Access

Admin Activity

Records user-initiated resource configuration
"GCE instance created by (user)"
"GCS Bucket deleted by (user)"
Always Enabled

System Event

Admin (non-user) initiated configuration calls
Always Enabled

Data Access

Record configuration (create/modify/read) of resource data
"Object (x) was created in bucket (y) by (users)"
Must be manually enabled (except BigQuery)
Not applicable to public resources

Access Transparency Logs

Only applicable for Enterprise or paid support plans
Logs og Google personnel access to your resources/data
- Example: Support request for VM instance
- Records action and access of support personnel
Always Enabled for applicable support plans

Log Type	System or User configured	Records what?	Default Setting
Admin Activity	User-initiated	Resource Configuration	Always Enabled
System Event	System-initiated	Resource Configuration	Always Enabled
Data Access	User-initiated	Resource Data Configuration	Manually Eanble
Access Transparency	User-initiated	Google personnel access	Always Enabled (on applicable support plans)

'Everything Else' Logs

Logs to Debug, Monitor and Troubleshoot:

Chargeable
User Logs - generated by software/applications
- Require Logging Agent
Platform logs - logs generated by GCP services
- Example: GCE startup script
VPC Flow Logs
Firewall Logs

Logs Pricing and Retention

Always Enabled logs have no charge with 400 days retention
- Admin Activity, System Event, Access Transparency
ALL other logs are chargeable with configurable retention period (default 30 days)
Pricing = $0.50/GB

Cloud Logging Tour

Data access logs - Add, edit, view object in a bucket

Enabled through IAM > Audit Logs
- To enable on single service, find the service e.g. Google Cloud Storage, tick Admin Read, Data Read, Data Write
  - Can add exempted users e.g. Admin user

Logging Agent Concepts

Agent captures additional VM logs
- OS logs/events
- 3rd Party application logs
Logging agent-based on fluentd (open source data collector)
Only applicable to GCE and EC2 (AWS)
- GKE uses Cloud Operations for GKE

Configuring the Agent

Per Google: The "out of the box" setup covers most use cases
Default installation/configuration covers:
- OS Logs
  - Linux - syslog
  - Windows - Event viewer
- Multiple 3rd party applications e.g. Apache, nginx, redis, rabbitmq, gitlab, jenkins, cassandra etc

Modifying Agent Logs Before Submission

Why modify logs?
- Remove sensitive data
- Reformat log fields (e.g. conbine two fields into one)
Additional configuration "plug-ins" can modify records
filter_record_transformer - most common
- Add/modify/delete fields from logs

Agent Setup Process

Add Repo (via provided script)
Update repos
Install Logging Agent
Install configuration files
Start the agent

Install Logging Agent and Collect Agent Logs

curl -sSO https://dl.google.com/cloudagents/add-logging-agent-repo.sh
sudo bash add-logging-agent-repo.sh
sudo apt update
sudo apt-get install google-fluentd
sudo apt install -y google-fluentd-catch-all-config
sudo service google-fluentd start

Logging Filters

Logs Viewer Query Interface

View logs through queries
Basic and Advanced query interface
Basic
- Dropdown menus - simple searches
Advanced
- View across log categories - advanced search capabilities

Basic and Advanced Filter Queries

Different query formats
- Search field syntax fifferent for each method
Basic query
- Not case-sensitive
- Built in field names for some logs

Advanced Filter Boolean Operators

Group/Exclude entries
- AND requires all conditions are met
- OR requires only one condition to be met
- NOT excludes condition
Order of precendence (i.e. order of operations)
- NOT -> OR -> AND
- a OR NOT b AND NOT c OR d = (a OR (NOT B)) AND ((NOT C) OR d)
- AND is implied

Constructing Advanced Fitler Queries

Generic text search = just type requested string
Searching fields
- Nested JSON format
  - resource.type="gce_instance"
  - resource.labels.zone="us-central1-a"
Search by set severity or greater
- severity >= WARNING
Filter by timestamp
- timestamp>="2018-12-31T00:00:00Z" AND timestamp<="2019-01-01T00:00:00Z"

Hands-On with Advanced Filters

Create advanced search filters
Search across log types
Use AND, OR and NOT operators
Explore new Logging interface

VPC Flow Logs

What are VPC flow Logs?

Recorded sample of network flows sent/received by VPC resources
- Near real-time recording
Enabled at the VPC subnet level

Use Cases

Network Monitoring - Understanding traffic growth for capacity forecasting
Forensics - who are your instances talking to?
Real-time security analysis
- Integrate (i.e. export) with other security products

VPC Flow Logs - Considerations

Generates a large amount of, potentially chargeable, log files
Does not capture 100% of traffic:
- Samples approximately 1 out of 10 packets. This cannot be adjusted
TCP/UDP only
Shared VPC - all VPC flow logs are in the host project

Firewall Logs

What are Firewall logs?

Logs of firewall rule effects (allow/deny)
Useful for auditing, verifying, and analyzing effect of rules
Applied per firewall rule, across entire VPC
Can be exported for analysis

Considerations

Logs every firewall connection attempt - best effort basis
TCP/UDP protocols only
Default "deny all" ingress and "allow all" egress rules are NOT logged

Viewing Deny/Allow All Logs?

Create an explicit firewall rules for the denied/allowed traffic you want to view - e.g. A duplicate rule to the default rule
Example: View all SSH attempts from outside of an allowed location
- Create a rule to deny all TCP:22 access from all locations - enable logging
- Assign low priority figure of 65534
- Assign higher priority "ssh-allow" rule for allowed location in source filter

VPC Flow Logs and Firewall Logs Demo

Routing and Exporting Logs

Main premise - route a copy of logs from Cloud Logging to somewhere else
- BigQuery, Cloud Storage, Pub/Sub, another logging bucket and more
Can export all logs, or certain logs based on defined criteria

Why Route/Export logs?

Long-term retention
- Compliance requirements
Big data analysis
- Analytics in BigQuery
Stream to other applications
- Pub/Sub connection
Route to alternate retention buckets

How Routing/Exports Work

3 components: Sink, Filter, Destination
Create a sink
- Sink = Object for filter.destination pairing
Create filter (query) of logs to export
- Can also set exclusions within filter
Set destination for matched logs
- Export will only capture new logs since the export was created, not previous ones

Export logs across Folder/Organization

Must use command line (or terraform), cannot create via the web console

gcloud logging sinks create my-sink \
storage.googleapis.com/my-bucket --include-children \
--organization=(organization-ID) --log-filter="logName:activity"

Logging Export IAM Roles

Owner/Logs Configuration Writer - create/edit sink
Viewer/Logs Viewer - view sink
Project Editor does NOT have access to create/edit sinks

Export Logs to BigQuery

In this Demo...

Export firewall logs to BigQuery, and analyze access attempts

Create a sink
Filter by firewall logs
Set BigQuery dataset as destination
Run queries in BigQuery to view denied access attempts

#standardSQL
SELECT  
jsonPayload.connection.src_ip,
jsonPayload.connection.dest_port,
jsonPayload.remote_location.continent,
jsonPayload.remote_location.country,
jsonPayload.remote_location.region,
jsonPayload.rule_details.action
FROM `log_export.compute_googleapis_com_firewall` 
ORDER BY jsonPayload.connection.dest_port

Logs-Based Metrics

What are logs-based metrics?

When is a log not just a log?
- When it is also a Cloud Monitoring metric!
Cloud Monitoring metrics based on defined log entries
- Example: number of denied connection attempts from firewall logs
Metric is created each time log matches a defined query
System (auto-created) and User-defined (custom) varieties

Types of logs-based metrics

Counter and Distribution

Counter:

Counter of logs that match an advanced logs query
Example: number of logs that match specific firewall log query
All system logs-based metrics are counter type

Distribution:

Records distribution of values via aggregation methods

From logs viewer > Create Metric or Logs based Metrics menu

Section Review

Milestone: Let the Record Show

Custom logs based distribution metrics

More powerful, collects a number value for each event and shows you how those values are distributed over the set of events
- Common use is to track latency
  - from each event received, a latency value is extracted from the log entry and added to the distribution
    - Concept of percentile is fundametaly about percentile

Hands-On Lab: Install and Configure Logging Agent on Google Cloud

SRE and Alerting Policies

SLOs and Alerting Strategy

Required Reading

Google SRE Workbook - "Alerting on SLO's"
https://sre.google/workbook/alerting-on-slos/

Alerts Review - Why we need them

Something is not working correctly
Action is necessary to fix it
Alerts inform relevant personnel that action is necessary when specified conditions met

Alerts and SRE

Continued errors = Danger of violating SLAs/SLOs (error budget being used up)
If issue isn't fixed, error budget will be used up
Proper alerting policy based on Service Level Indicators (SLIs) enables us to preserve error budget
Balance multiple alerting parameters

Precision | Recall | Detection time | Reset time

Precision: Rate of 'relevant' alerts vs. low priority events
- Does this event require immediate attention?
Recall: Percent of significant events detected
- Was every 'real' event properly detected? Did we miss some?
Detection time: Time taken to detect significant issue
- Longer detection time = more accurate detection, but longer duration of errors before
Reset time: How long alerts persist after issue is resolved
- Longer reset time = confusion/'white noise'

How do we balance these parameters?

Window Length: Time period measured
- % of errors over (x) time period
  - Example: average CPU utilization per minute vs. per hour
  - Small windows = faster alert detection, but more 'white noise'
    - CPU averaging 80% for 1-minute window
    - Hurts precision, helps recall
  - Longer windows = More precise ('real' problem vs. white noise)
    - CPU Averaging 95% for 1-hour window
    - Longer detection time
    - Good precision, poor detection time
    - Once problem determined, more error budget may already be used up
Duration: How long something exceeds SLIs before 'significant' event declared
- For field e.g. 1 minute, is the duration
- Short 'blips' vs. sustained errors over longer time period
  - Reduced 'white nose'
  - Poor recall, good precision
    - Outage of 2 minutes on 5 min duration never detected
    - Misses massive spikes in errors over shorter durations

Optimal Solution - Multiple conditions/notification channels

No one single alerting policy/condition can properly cover all the scenarios

Google Recommends (Condensed from the multi-page doc listed in required reading)

Multiple conditions
- Long and short time windows
- Long and short durations
Multiple notification channels based on severity
- Low priority anomalies to reporting system
  - Pub/Sub topic to analysis application to look for trends
  - No immediate human interaction required
- Major (customer-impacting) events to on-call team
  - Requires immediate escalation

Service Monitoring

wget https://raw.githubusercontent.com/linuxacademy/content-gcpro-devops-engineer/master/scripts/app-engine-quick-deploy.sh source app-engine-quick-deploy.sh

Not in exam, but interesting

GCP Console > Operations > Services

Create SLO

Latency (App Engine)
Request-based - simply counts individual events (Windows-based is more advanced, good mins vs bad mins, entries above/below the SLI)
Define SLI
- Latency Threshold - 200ms - Response time, all our requests have to exceed to be within the SLI
Set your SLO based on SLI
- Compliance period: Calendar : 1-day, Rolling = Any 24hr period
- Performance Goal: 80% of good response time requests e.g. 80% of our customer requests must be 200ms or less
  - Changing this value up e.g. to 99% will affect the error budget (bring it down), that's because some requests exceed 200ms response time
Name: Average Customer latency - 80% SLO

Milestone: Come Together, Right Now, SRE

The services console is really valuable, because when setting the SLO, GCP already has the historical data and can show you instant feedback when determining your SLO (P95 value) - Automation built into GCP console reducing risks human might make when adding up.

Really great tool, not manually calculating error budgets, GCP does it all for you!

Optimize Performance with Trace/Profiler

Section Introduction

What the Services Do and Why They Matter

Cloud Trace

"A Distributed tracing system that collects latency data from your applications and displays it in the GCP Console"

Operational Management

Latency Management

Google's 4x Golder Signals

Latency

Cloud Trace: Primary Features

Works with App Engine, VMs and container (e.g. GKE, Cloud Run)
Shows general aggregated latency data
Shows performance degradations over time
Identifies bottlenecks
Alerts automatically if there's a big shift
SDK supports Java, Node.js, Ruby and GO
API Available to work with any source

Cloud Profiler

"Continuously analyzes the performance of CPU or memory-intensive functions executed across an application"

Cloud Profiler: Primary Features

Improve performance
Reduce costs
Supports Java, Node.js, Python and Go
Agent-based
Extremely low-impact
Profiles saved for 30 days
Export profiles for longer storage
Free!

Types of Profiling Supported

Profile Type	Go	Java	Node.js	Python
CPU time	X	X	-	X
Wall time	-	X	X	X
Heap	X	X	X	-
Allocated Heap	X	-	-	-
Contention	X	-	-	-
Threads	X	-	-	-

CPU Time - Time it takes the processor to run whatever function (in code) is being processed Wall Time - Total time (Wall clock time); Time elapsed between entering and exiting a function includes all wait time, locked and thread syncronisation Heap - Amount of memory, allocated in the programs heap at the instant the profile is collected Heap Allocation - Total amount of memory that was allocated in the programs heap during the interval between the first collection and the next collection. This value includes any memory that was allocated any has either been freed or is no longer in use Contention - Go specific - Profile mutex contention for Go, mutual exclusion lock, data access across concurrent processes. Determine the amount of time waiting for mutexs and frequency at which contention occurs Threads - Profile thread usage for Go, and capture the information on Go routines, and Go concurrency mechanisms

Tracking Latency with Cloud Trace

Set up a sample app that utilizes a Kubernetes Engine cluster and three different services to handle a request and generate a response, all the while tracking the latency with Cloud Trace. Once we’ve made a number of requests, we’ll examine the results in the Cloud Trace console and I’ll show you how you can check the overall latency of the request as well as breaking it down into its component parts.

Traces are gathered at every step, e.g. across each of the 3x load balancers

Enable K8's engine API Enable Cloud Trace API

Python Code for Cloud Trace

Python Code for Cloud Trace Middleware

Python Code for Cloud Trace Execute Trace

Everytime the app executes, the trace executes

Dashboard outputs the full trace, with waterfall graph showing component parts (spans - to help identify latency bottlenecks)

Accessing the Cloud Trace APIs

Instrumenting Your Code for Cloud Trace

3x methods

Client Libraries	Open Telemetry	OpenCensus
Ruby	Node.js, Python (now)	Python
Node.js	Go	Java
C# ASP.NET Core	In Active development	Go
C# ASP.NET	Recommended by Google	PHP

Comparing APIs

V1	vs	V2
Send traces to Cloud Trace	-	Send traces to Cloud Trace
Update existing traces	-	-
Get lists of traces	-	-
get the details of a single trace	-	-
Supports v1 REST and v2 REST as well as v1 and v2 RPC	-	Supports v1 REST and v2 REST as well as v1 and v2 RPC
A trace is represented by a Trace resource (REST) and Trace message (RPC)	-	No explicit trace object; include trace ID in span to identify
A span is represented by a TraceSpan resource (REST) and TraceSpan message (RPC)	-	A span in represented by the Span resource
Uses Labels fields	-	Uses attributes fields

Setting Up Your App with Cloud Profiler

Cloud Profiler Supported Environments

Compute Engine
App Engine Standard Environment
App Engine Flexible Environment
Kubernetes Engine
Outside of Google Cloud

Python Code to enable Cloud Profiler to track CPU

3 Steps to Setting up Cloud Profiler Outside GCP

Enable profiler API inside GC Project
Get credentials for the profiling agent
- Service account with private-key auth
- Application default credentials (ADC)
Configure the agent
- Pass in the project ID via a Config() or similar method

Analyzing Cloud Profiler Data

As Go is the most supported language, the demo in this video uses Go.

TLDR: Analyze Code with graphs, IO, CPU, locks, etc, cool if you're coding with running your apps within GKE or on GCP

Section Review

Cloud Trace: Latency - Demo app with 3x Load Balancers, Spans, OpenTelemetry Cloud Profiler: Continuously analyzing CPU & Memory Performance, Supported services, Code necessary for implementation, Examine flame graphs for profile types, threads, alloc etc

Milestone: It All Adds Up!

Hands-On Lab: Discovering Latency with Google Cloud Trace

Stand-up infrastructure

gcloud services enable cloudtrace.googleapis.com
gcloud services enable container.googleapis.com
gcloud container clusters create cluster-1 --zone us-central1-c
gcloud container clusters get-credentials cluster-1 --zone us-central1-c # If created through the WebUI
kubectl get nodes

Update [PROJECT_ID] in ./trace/acg-service-*.yaml*.template within the cloned repo
Build Container Image, Tag and push to container registry

docker build -t acg-image .
docker tag acg-image gcr.io/discovering--219-46060d10/acg-image
docker push gcr.io/discovering--219-46060d10/acg-image

I fucked the deployment by changing the tag in the deployment. The tag is referenced in the deployment spec under ./trace/app/acg-service-(a|b|c).yaml

kubectl Cheatsheet

You can force refresh containers after fixing (re-tag, re-push) the issue

cloud_user_p_cff0f65a@cloudshell:~/content-gcpro-operations/trace (discovering--219-46060d10)$ kubectl rollout restart deployment/cloud-trace-acg-c
deployment.apps/cloud-trace-acg-c restarted
cloud_user_p_cff0f65a@cloudshell:~/content-gcpro-operations/trace (discovering--219-46060d10)$ kubectl get all
NAME                                     READY   STATUS              RESTARTS   AGE
pod/cloud-trace-acg-a-5f49648db7-ztjr5   1/1     Running             0          20s
pod/cloud-trace-acg-b-6966b8cb56-ztnvh   1/1     Running             0          4s
pod/cloud-trace-acg-b-bb7c98995-bfgld    0/1     Terminating         0          16m
pod/cloud-trace-acg-c-66497849b7-xlvjm   1/1     Running             0          16m
pod/cloud-trace-acg-c-6d5d5b489d-bwdbr   0/1     ContainerCreating   0          2s

Run setup.sh script

bash setup.sh

Test the app

curl $(kubectl get svc cloud-trace-acg-c -ojsonpath='{.status.loadBalancer.ingress[0].ip}')

Identifying Application Errors with Debug/Error Reporting

Section Introduction

Troubleshooting with Cloud Debugger

"Inspect code in real time, without stopping or slowing it down" (killer feature)

Multiple Source Options
- Cloud Source repositories
- GitHub
- BitBucket
- GitLab
Code Search
- Quickly find code in a specific file, function, method or by line number
Code Share
- Debug sessions can be shared with any teammate for collaborative debugging
IDE Integration
- Integrates with IntelliJ IDE, as well as VSCode and Atom

Completely Free to Use

Key Workflows

Two Key Cloud Debugger Tools

Snapshots	vs.	Logpoints
- Captures application state at a specific line location	-	- Inject logs into running apps without redeployment
- Captures local variables	-	- Logpoints remain active for 24 hours if not deleted or service not redeployed
- Captures call stack	-	- Supports canarying
- Take snapshots conditionally (Java, Python and Go)	-	- Add Logpoints conditionally
- Supports canarying	-	Output ent to target's appropriate environment

Demo of Code debugging in Python

Establishing Error Reporting for Your App

Error Reporting Supported Languages

php
Java
Python
.NET
Node.js
Ruby
Go
Compute Engine
- Cloud Logging
- Error Reporting API
App Engine
- Automatic set-up
- Standard Environment:
  - Additional set-up may be neccessary
  - Only errors with stack trace processed
- Flexible Environment:
  - No additional setup required
  - Analyzes messages to stderr
Cloud Run
- Automatic setup
- Analyzes messages written to stderr, stdout, or other logs that have stack trace
Kubernetes Engine
- Use Cloud Logging
- Use Error Reporting API
Cloud Functions
- Automatic setup
- Unhandled JavaScript exceptions processed

Working with the Error Reporting API in Python

Import error_reporting library
Instantiate Client()
Call report() method, passing in any string

Python example in Video

Managing Errors and Handling Notifications

Error Reporting Entries Explained

Occurances (withing selected time/day range) Users affected (Within selected time/day range) Error (Extracted from stack trace with code location, linked if in CSR) Seen In (service and version, if any) First Seen (time/date first appeared) Last Seen Response Code (HTTP Status code, if any) Link to issue URL (Optional text field)

Error Reporting Notifications

Enabled per project
Must have Project Owner, Project Editor, Project Viewer role, or custom role with cloudnotifications.activities.list permission
Sent to email of specified roles
May be forwared to alias or Slack channel
Send to mobile app, if enabled and subscribed (Cloud Console App)

Section Review

Milestone: Come Together - Reprise (Debug Is De Solution)

Hands-On Lab: Correcting Code with Cloud Debugger

gcloud services list --available | grep -i debug
gcloud services enable clouddebugger.googleapis.com
git clone https://github.com/linuxacademy/content-gcpro-operations
cd content-gcpro-operations/debugger/
gcloud app deploy # This failed first time, as GCP hadn't finished setting up services, just re-run and was successful
gcloud app browse # Opens link to the app

Update Code because guru fails to return properly reversed

Fix code using debugger, set info point and use breakpoints

52 KiB Raw Permalink Blame History Unescape Escape

Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)

Introduction

About the Course and Learning Path

Milestone: Getting Started

Understanding Operations in Context

Section Introduction

What Is Ops?

Clarifying the Stackdriver/Operations Connection

Operations and SRE: How Do They Relate?

Operation Services at a Glance

Section Review

Milestone: The Weight of the World (Teamwork, Not Superheroes)

Monitoring Your Operations

Section Introduction

Cloud Monitoring Concepts

Monitoring Workspaces Concepts

Monitoring Workspaces

Perspective: Workspaces in Context

What Are Metrics?

Exploring Workspace and Metrics

Monitoring Agent Concepts

Installing the Monitoring Agent

Collecting Monitoring Agent Metrics

Integration with Monitoring API

Create Dashboards with Command Line

GKE Metrics

Perspective: What's Up, Doc?

Uptime Checks

Establishing Human-Actionable and Automated Alerts

Section Review

Milestone: Spies Everywhere! (Check Those Vitals!)

Hands-On Lab: Install and Configure Monitoring Agent with Google Cloud Monitoring

Logging Activities

Section Introduction

Cloud Logging Fundamentals

Log Types and Mechanics

Cloud Logging Tour

Logging Agent Concepts

Install Logging Agent and Collect Agent Logs

Logging Filters

Hands-On with Advanced Filters

VPC Flow Logs

Firewall Logs

VPC Flow Logs and Firewall Logs Demo

Routing and Exporting Logs

Export Logs to BigQuery

Logs-Based Metrics

Section Review

Milestone: Let the Record Show

Hands-On Lab: Install and Configure Logging Agent on Google Cloud

SRE and Alerting Policies

SLOs and Alerting Strategy

Service Monitoring

Milestone: Come Together, Right Now, SRE

Optimize Performance with Trace/Profiler

Section Introduction

What the Services Do and Why They Matter

Tracking Latency with Cloud Trace

Accessing the Cloud Trace APIs

Setting Up Your App with Cloud Profiler

Analyzing Cloud Profiler Data

Section Review

Milestone: It All Adds Up!

Hands-On Lab: Discovering Latency with Google Cloud Trace

Identifying Application Errors with Debug/Error Reporting

Section Introduction

Troubleshooting with Cloud Debugger

Establishing Error Reporting for Your App

Managing Errors and Handling Notifications

Section Review

Milestone: Come Together - Reprise (Debug Is De Solution)

Hands-On Lab: Correcting Code with Cloud Debugger

Course Conclusion

Milestone: Are We There, Yet?

Practice Exam / Quiz: Google Certified Professional Cloud DevOps Engineer Exam Prep

52 KiB

Raw Permalink Blame History