gpcdesre/Part_4.md
2021-02-23 09:53:47 +00:00

52 KiB
Raw Permalink Blame History

Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)

Introduction

About the Course and Learning Path

Make better software, faster

Milestone: Getting Started

Understanding Operations in Context

Section Introduction

What Is Ops?

  • GCP Defined

    • "Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"

    • Logging Management

      • Gather Logs, metrics and traces everywhere
        • Audit, Platform, User logs
          • Export, Discard, Ingest
    • Error Reporting

      • So much data, How do you pick out the important indicators?
        • A centralized error management interface that shows current & past errors
        • Identify your app's top or new errors at a glance, in a dedicated dashboard
        • Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
    • Across-the-board Monitoring

      • Dashboards for built-in and customizable visualisations
        • Monitoring Features
          • Visual Dashboards
      • Health Monitoring
        • Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
      • Service Monitoring
        • Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
    • SRE Tracking

      • Monitoring is critical for SRE
        • Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
          • Pinpoint SLI's and develop and SLO on top of it
    • Operational Management

      • Debugging
        • Inspects the state of your application at any code location in production without stopping or slowing down requests
      • Latency Management
        • Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
      • Performance Management
        • Offers continuous profiling of resource consumption in your production applications along with cost management
      • Security Management
        • With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud

What is Ops: Key Takeaways

  • Ops Defined: Watch, learn and fix
  • Primary services: Monitoring and Logging
  • Monitoring dashboads for all metrics, including health and services (SLOs)
  • Logs can be exported, discarded, or ingested
  • SRE depends on ops
  • Error alerting pinpoints problems, quickly

Scratch:

  • Metric query and tracing analysis
  • Establish performance and reliability indicators
  • Trigger alerts and error reporting
  • Logging Features
  • Error Reporting
  • SRE Tracking (SLI/SLO)
  • Performance Management

Clarifying the Stackdriver/Operations Connection

  • 2012 - Stackdriver Created
  • 2014 - Stackdriver Acquired by Google
  • 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
  • 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name

Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs (formerly all called Stackdriver )

"StackDriver" lives on - in the exam only

Integration + Upgrades

  • Complete UI Integrations
    • All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
  • Dashboard API
    • New API added to allow creation and sharing of dashoards across projects
  • Log Retention Increased
    • Logs can now be retained for up to 10 years and you have control over the time specified
  • Metrics Enhancement
    • In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
  • Advanced Alert Routing
    • Alerts can now be routed to independent systems that support Cloud Pub/Sub

Operations and SRE: How Do They Relate?

  • Lots of questions in Exam on SRE

What is SRE? - "SRE is what happens when a software engineer is tasked with what used to be called operations" (Founder Google SRE Team)

Pillars of DevOps

  • Accept failure as normal:

    • Try to anticipate, but...
    • Incidents bound to occur
    • Failures help team learn
  • No-fault postmortems & SLOs:

    • No two failures the same
    • Track incidents (SLIs)
    • Map to Objectives (SLOs)
  • Implement gradual change:

    • Small updates are better
    • Easier to review
    • Easier to rollback
  • Reduce costs of failures:

    • Limited "canary" rollouts
    • Impact fewest users
    • Automate where possible
  • Measure everything:

    • Critical guage of sucess
    • CI/CD needs full monitoring
    • Synthetic, proactive monitoring
  • Measure toil and reliability:

    • Key to SLOs and SLAs
    • Reduce toil, up engineering
    • Monitor all over time

SLI: "A carefully defined quantitative measure of some aspect of the level of service that is provided"

SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing

Example SLIs:

  • Request Latency: How long it takes to return a response to a request
  • Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
  • Batch Throughput - Proportion of time = data processing rate > than a threshold

Commit to Memory - Google's 4x Golden Signals!

  • Latency
    • The time is takes for your service to fulfill a request
  • Errors
    • The rate at which your service fails
  • Traffic
    • How much demand is directed at your service
  • Saturation
    • A measure of how close to fully utilized the services' resources are

LETS


SLO: "Service level objectives (SLOs) specify a target level for the reliability of your service" - The site reliability workbook

SLOs are tied to you SLIs

  • Measured by SLLI
  • Can be a single target value or range of values
  • SLIs <= SLO
  • or
  • (lower bound <= SLI <= upper bound) = SLO
  • Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)

SLI - Metric over time which detail the health of a service

  • example: Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile

SLO - Agreed-upon bounds how often SLIs must be met

  • example: 95% percentile homepage SLI will suceed 99.9% of the time over the next year

Phases of Service Lifetime

SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:

  • Measure and track SLIs (Measuring increasing performance)
  • Evaluate reliability
  • Define SLOs
  • Build capacity models
  • Establish incident response, shared with dev team

General Availability Phase

  • After Production Readiness Review passed
  • SREs handle majority of op work
  • Incident responses
  • Track operational load and SLOs

Ops & SRE: Key Takeaways

  • SRE: Operations from a software engineer
  • Many shared pillars between DevOps/SRE
  • SLIs are quantitative metrics over time
  • Remember the 4x Google Golden Signals (LETS)
  • SLOs are a target objective for reliability
  • SLIs are lower then SLO - or - in-between upper and lower bound
  • SREs are most active in limited availability and general availability phases

Operation Services at a Glance

10,000ft view!

  • Cloud Monitoring - Provides visibility into the performance, uptime, and overall health of cloud-powered applications
  • Cloud Logging - Allows you to store, search, analyze, monitor and alert on log data and events from GCP

Gives SRE's the ability to evaluate SLI's and keep on track with SLO's

Help make operations run smoother, reliabily and more efficiently

  • Cloud Debugger - When and not if the system encounters issues, lets you inspect the state of a running application in real-time without interference
  • Cloud Trace - Maps out exactly how your service is processing the various requests and responses it receives, all while tracking latency
  • Cloud Profiler - Keeps an eye on your codes performance - Continuously gather CPU usage and memory-allocation information from your production applications, looking for bottlenecks

Ops Tools Working Together

Gather Information

  • Collect signals
    • Through Metrics
      • Apps
      • Services
      • Platform
      • Microservices
    • Logs
      • Apps
      • Services
      • Platform
    • Trace
      • Apps

Handle Issues

  • Alerts
  • Error Reporting
  • SLO

Troubleshoot

Display and Investigate

  • Dashboards
  • Health Checks
  • Log Viewer
  • Service Monitoring
  • Trace
  • Debugger
  • Profiler

5x Main Services Are:

  • Cloud Monitoring
  • Cloud Logging
  • Cloud Debugger
  • Cloud Trace
  • Cloud Profiler

All work together to gather info, manage instance and troubleshoot, by allowing you to display the collective signals and analyze them

Section Review

Monitoring, troubleshooting and improving applications performance.

Milestone: The Weight of the World (Teamwork, Not Superheroes)

Monitoring Your Operations

Section Introduction

Cloud Monitoring Concepts

Measures key aspects of your services

  • Captures resource and application signal data
    • Metrics
    • Events
    • Metadata

What questions does Cloud Monitoring answer?

  • How well are my resources performing?
  • Are my applications meeting their SLAs?
  • Is something wrong that requires immediate action?

Workspace - 'Single pane of glass' for viewing and monitoring data across projects Installed Agents (optional) - Additional application-specifc signal data Alerts - Notify someone when something needs to be fixed

Monitoring Workspaces Concepts

What is a monitoring Workspace?

  • GCP's organization tool for monitoring GCP and AWS resources
    • All montioring tools live in a Workspace
      • Uptime checks
      • Dashboards
      • Alerts
      • Charts
Monitoring Workspace (within a project) ------ (One of More GCP Projects) ------< GCP Projects
  • Workspace exists in a Host Project - Create project first, then Monitoring Workspace within the project
  • Projects can only contain a single Workspace
  • Workspace name same as host project (cannot be changed) - Workspace name will be same as project_id (worth having a project dedicated to the workspace name)
  • Workspace can monitor resources in same projects and/or other projects
  • Workspace can monitor multiple projects simultaneously
  • Projects associated with a single workspace
  • Projects can be moved from one workspace to another
  • Two different workspaces can be merged into a single workspace
  • Workspace access other projects' metric data, but data 'lives' in those projects

Project Grouping Strategies - No single correct answer

Single workspace can monitor upto 200 projects

  • A single workspace for all app-related projects "app1"
    • Pro: Single pane of glass for all application/resource data for single app
    • Con: If need to restrict dev/prod teams from viewing data in each others projects. Too broad access
  • 2x Workspaces for viewing projects in each tier e.g. Dev/Prod/QA/Test
    • Limited scope so can restrict teams
    • More than one place to investigate should you need to
  • Single Workspace per project
    • Maximum isolation
    • More limited view

Monitoring Workspaces IAM roles

  • Applied to workspace project to view add projects' monitoring data
  • Monitoring Viewer/Editor/Admin
    • Viewer = Read-only access (view metrics)
    • Editor = Edit workspace, write-access to monitor console and API
    • Admin = Full access including IAM roles
  • Monitoring Metric Writer = Service Account role
    • Permit writing data to workspace
    • Does not provide read access

Monitoring Workspaces

  • Creating a workspace

    • Operations > Monitoring > Overview
    • Select a project, and workspace is created for that project
  • Adding projects to a workspace

    • Within Overview > Settings > GCP Projects Section > Add GCP Projects
  • Moving projects between workspaces

    • Goto Workspace where project is located that you want to move
      • Settings > 'GCP Projects' Section > Select 3x Little dots on line of project you want to move > 'Move to another workspace'
        • If you're moving a project where you've created custom dashboards in the current workspace, these will be lost
  • Merging workspaces

    • Navigate into the workspace that you want to merge into (ws-1):
      • Settings > MERGE and select the workspace (ws-2) you want to move into your current workspace (ws-1)
        • Merge ws-2 into ws-1 whereby ws-2 is deleted as part of the process

Note: Cloud playgrounds will only have a single project

Perspective: Workspaces in Context

How can you make all the data easier to manage? e.g. Don't dump everything into one workspace

What Are Metrics?

  • Workspace provides visibility into everything that is happening in your GCP resources
  • Within Workspace, we view information using:
    • Metrics
    • ...viewed in Charts...
    • ...grouped in Dashboards...

What are Metrics?

  • Raw data GCP uses to create charts
  • Over 1000 pre-created metrics
  • Can create custom metrics
    • Built in Monitoring API
    • OpenCensus - open source library to create metrics
  • Best practice: don't create a custom metric where a default metric already exists

Anatomy of a Metric

  • Value Types - metric data type

    • BOOL - boolean
    • INT64 - 64-bit integer
    • DOUBLE - Double precision float
    • STRING - a string
  • Metric Kind - relation of values

    • Guage - Measure specific instant in time (e.g. CPU Utilization)
    • Delta - Measure change since last recording (e.g. requests count since last data point)

Exploring Workspace and Metrics

Video just does a quick walkthrough of the dashboard and creating a chart with the metrics explorer

Monitoring Agent Concepts

  • CLoud Monitoring collects lots of metrics with no additional configuration needed

    • "Zero config monitoring"
    • Examples: CPU utilization, network traffic
  • More granular metrics can be collected using an optional monitoring agent

    • Memory usage
    • 3rd Party app metrics (Nginx, Apache)
  • Separate agent for both monitoring and logging

  • Monitoring Agent = collectd

  • Logging Agent = fluentd

Which GCP (and AWS) Services Need Agents?

  • Not all compute services require (or even allow installation of) agents
  • Services that support agent installation:
    • Compute Engine
    • AWS EC2 instances (required for all metrics)
  • Services which don't support agent installation:
    • Everything else
    • Managed services either have an agent already installed or simply don't require one
    • Google Kubernetes Engine has Cloud Operations for GKE pre-installed

Installing the Agent - General Process

  • Add installation location as repo
  • Update repos
  • Install agen from repo ('apt install ...')
  • (Optional) Configure agent for 3rd party application

Installing the Monitoring Agent

  • Manually install and configure monitoring agent on Apache web server
  • Demonstrate how to automate the agent setup process

The below commands will create an instance with a custom web page. The first instance does not have the agent installed, and the second instance is with the agent installed. You will need to allow port 80 on your VPC firewall if it's not already enabled, scoped to your instance tag (command included for reference).

Create the firewall rule, allowing port 80 access on the default VPC (modify if using a custom VPC):

gcloud compute firewall-rules create default-allow-http --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:80 --source-ranges=0.0.0.0/0 --target-tags=http-server

Create a web server without an agent:

gcloud beta compute instances create website-agent --zone=us-central1-a --machine-type=e2-micro --metadata=startup-script-url=gs://acg-gcloud-course-resources/devops-engineer/operations/webpage-config-script.sh --tags=http-server --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=website-agent

Create web server and install an agent: (The Automated Example)

gcloud beta compute instances create website-agent --zone=us-central1-a --machine-type=e2-micro --metadata=startup-script-url=gs://acg-gcloud-course-resources/devops-engineer/operations/webpage-config-with-agent.sh --tags=http-server --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=website-agent

To manually install the agent (with Apache configuration) on an instance:

  • Add the agent's package repository
curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo bash add-monitoring-agent-repo.sh
sudo apt-get update
  • To install the latest version of the agent, run:
sudo apt-get install stackdriver-agent
  • To verify that the agent is working as expected, run:
sudo service stackdriver-agent status
  • On your VM instance, download apache.conf and place it in the directory /opt/stackdriver/collectd/etc/collectd.d/:
(cd /opt/stackdriver/collectd/etc/collectd.d/ && sudo curl -O https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/apache.conf)
  • Restart the monitoring agent:
sudo service stackdriver-agent restart

Collecting Monitoring Agent Metrics

  • Generate traffic to our Apache web server
  • View web server metrics in Cloud Monitoring Workspace

When you have the agent installed, you can see 2x new tabs depending on what's been configured. In the example "Agent" & "Apache" are visible with associated metrics that wouldn't otherwise be there

Integration with Monitoring API

What is the Monitoring API?

  • Manipulate metrics utilizing Google Cloud API's
  • Accessible via external services
    • Monitoring dashboards in Grafana
    • Export metric data to BigQuery
  • Create and share dashboards with programmatic REST and gRPC syntax
    • Efficient than hand-created dashboards from scratch

External Integration Use Cases

  • Keep metrics for long-term analysis/storage
    • Cloud Monitoring holds metrics for six weeks
  • Share metric data with other analytic platforms (e.g. Grafana)

How does Metrics Integration Work?

  • Short version: Export/Integrate via Cloud Monitoring API
  • If exporting metrics:
    • Define metrics with metric descriptor (JSON format)
    • Export via Monitoring API to BigQuery
  • If using 3rd Party service (e.g. Grafana), authenticate to Cloud Monitoring API with service account

Example BigQuery Export
Metric Descriptor json
HTTPS REST API call for Grafana

Programmatically Create Dashboards

  • Create and export dashboard via Monitoring API
  • Dashboard configuration represented in JSON format
  • Create dashboards and configurations via gcloud command or directly via REST API
  • gcloud monitoring dashboards create --config-from-file=[file_name.json]

Create Dashboards with Command Line

  • Create custom dashboards via command with JSON config files
    • Use both gcloud command or directly via REST API
  • Export current dashboard into JSON configuration file

This is cool, because we can create and hold our dashboards under source control

  • Download configuration file
wget https://raw.githubusercontent.com/GoogleCloudPlatform/monitoring-dashboard-samples/master/dashboards/compute/gce-vm-instance-monitoring.json
  • Create dashboard with gcloud command and config file:
gcloud monitoring dashboards create --config-from-file=gce-vm-instance-monitoring.json
  • Do same thing via REST API
curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
https://monitoring.googleapis.com/v1/projects/(YOUR-PROJECT-ID-HERE)/dashboards -d @gce-vm-instance-monitoring.json
  • Export current dashboard

Create shell variables for Project ID and Project Number

export PROJECT_ID=$(gcloud config list --format 'value(core.project)')

export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)")
  • Export dashboard using above variables. Substitute your dashboard ID and exported file name where appropriate Export current dashboard by dashboard ID
gcloud monitoring dashboards describe \
projects/$PROJECT_NUMBER/dashboards/$DASH_ID --format=json > your-file.json

GKE Metrics

  • "Enable Cloud Operations for GKE" - Simple tickbox to turn on

_ GKE natively integrated with both Cloud Monitoring and Logging

  • Ability to toggle 'Cloud Operations for GKE' in cluster settings, enabled by default
  • Cloud Operations for GKE replaces older 'Legacy Monitoring and Logging'
  • Integrates with Prometheus

What K8's Metrics are Collected?

  • k8s_master_component
  • k8s_cluster
  • k8s_node
  • k8s_pod
  • k8s_container

If you need a 'one-click' script to build out a demonstration GKE web application from scratch, copy and paste the below command. It will download and execute a script to build out the environment. The process will take about five minutes to complete:

wget https://raw.githubusercontent.com/linuxacademy/content-gcpro-devops-engineer/master/scripts/quick-deploy-to-gke.sh

source quick-deploy-to-gke.sh

Perspective: What's Up, Doc?

  • Uptime checks are very valuable

Uptime Checks

What are Uptime Checks?

  • Periodic request sent to a monitor resource and waits for a response (or is "up")

  • Check uptime of:

    • VMs
    • App Engine services
    • Website URLs
    • AWS Load Balancer
  • Create uptime check via Cloud Monitoring

  • Optionally, create an alert to notify is uptime check fails

  • IMPORTANT: Uptime checks are subject to firewall access

    • Allow access to the uptime check IP range

Establishing Human-Actionable and Automated Alerts

Why do we care about Alerts?

  • Sometimes, things break
  • No one wants to endlessly stare at dashboards for something to go wrong
  • Solution: Use Alerting Policy to notify you if something goes wrong

Alerting Policy Components

  • Conditions - describes conditions to trigger an alert
    • Metrics threshold exceeded/not met
    • Create an incident when thresholds are violated
  • Notifications - who to notify when the alerting policy is triggered
  • (optional) - Documentation - included in notifications with action steps

Incident Handling

  • Alerting event occurs when alerting policy conditions are violated
  • Creates Incident in Open state
  • Incident can then be Acknowledged (investigated) and Closed (resolved)

Alerting Policy IAM Roles

  • Uses Cloud Monitoring roles to create an alerting policy
  • Monitoring Editor, Admin, Project Owner
  • Monitoring Alert Policy Editor - minimal permissions to create an alert via Monitoring API

GCP creates actual incidents under alerts, that you can "resolve"

Section Review

Monitoring Your Operations

  • Cloud Monitoring Concepts
  • Monitoring Workspaces
  • What are Metrics?
  • Exploring Workspaces and Metrics
  • Monitoring Agent
  • Monitoring API and CLI usage
  • GKE Metrics - Master to individual containers
  • Uptime Checkes
  • Establishing Human-Actionable and Automated Alerts

Milestone: Spies Everywhere! (Check Those Vitals!)

Hands-On Lab: Install and Configure Monitoring Agent with Google Cloud Monitoring

Logging Activities

Section Introduction

Logging Activities: See next headings

Cloud Logging Fundamentals

What is Cloud Logging?

  • Cloud Operations service for storing, viewing, and interacting with logs:
    • Reading and writing logs entries
    • Query logs
    • Export to other services (internal to GCP and external)
    • Create metrics from logs
  • Interact with Logs Viewer and API
  • Multiple log types available
  • Logs used by other Cloud Operations services (debug, error reporting, etc)

What is a log?

  • Record of status or event (string format)
    • "What happened?"
  • Log Entry - individual logs in a collection
  • Log Payload - contents of the Log Entry
    • Contains nested Fields

Logs Retention Period

  • Varies by log type:
    • Admin Activity, System Event, Access Transparency
      • 400 days
      • Non-configurable
    • All other log types:
      • 30 days by default
      • Configurable retention period

IAM Roles

  • Generic and service account varieties
  • Service Account:
    • Logs Writer: Write logs, no view permissions
    • Logs Bucket Writer: Write logs to logs buckets
  • Logs Viewer - View logs except Data Access/Access Transparency (known as private logs)
  • Private Logs Viewer - View all of the above
  • Logs Configuration Writer - Create logs-based metrics, buckets, views and export sinks
    • 'Change configruations'
  • Logging Admin - Full access to all logging actions
  • Project Viewer - View all logs except Data Access/Access Transparency
  • Project Editor 0 Write, view and delete logs. Create logs-based metrics
    • Cannot create export sinks or view Data Access/Access Transparency logs
  • Project Owner - all logging-based permissions

Log Types and Mechanics

Scope of Collecting and Viewing Logs

  • Scoped by project
  • View project-1 logs in project-1
  • No built-in "single pane of glass"
  • Can export logs org-wide or multiple projects

Log Types - Primary Categories Security Logs vs. Non-security Logs Always Enabled (non-configurable) vs. Mnaually Enabled (configurable):

  • Always Enabled/REquired
    • No change
    • 400 days retention
  • Manually Enabled logs
    • Charged based on log amount
    • 30 days retention (configurable) Above categories overlap

Security Logs

Audit logs and Access transparency logs

  • "Who did what? where? and when?"
  • Also accessible via Activity Log

Admin Activity | System Event | Data Access

Admin Activity

  • Records user-initiated resource configuration
  • "GCE instance created by (user)"
  • "GCS Bucket deleted by (user)"
  • Always Enabled

System Event

  • Admin (non-user) initiated configuration calls
  • Always Enabled

Data Access

  • Record configuration (create/modify/read) of resource data
  • "Object (x) was created in bucket (y) by (users)"
  • Must be manually enabled (except BigQuery)
  • Not applicable to public resources

Access Transparency Logs

  • Only applicable for Enterprise or paid support plans
  • Logs og Google personnel access to your resources/data
    • Example: Support request for VM instance
    • Records action and access of support personnel
  • Always Enabled for applicable support plans
Log Type System or User configured Records what? Default Setting
Admin Activity User-initiated Resource Configuration Always Enabled
System Event System-initiated Resource Configuration Always Enabled
Data Access User-initiated Resource Data Configuration Manually Eanble
Access Transparency User-initiated Google personnel access Always Enabled (on applicable support plans)

'Everything Else' Logs

Logs to Debug, Monitor and Troubleshoot:

  • Chargeable
  • User Logs - generated by software/applications
    • Require Logging Agent
  • Platform logs - logs generated by GCP services
    • Example: GCE startup script
  • VPC Flow Logs
  • Firewall Logs

Logs Pricing and Retention

  • Always Enabled logs have no charge with 400 days retention
    • Admin Activity, System Event, Access Transparency
  • ALL other logs are chargeable with configurable retention period (default 30 days)
  • Pricing = $0.50/GB

Cloud Logging Tour

Data access logs - Add, edit, view object in a bucket

  • Enabled through IAM > Audit Logs
    • To enable on single service, find the service e.g. Google Cloud Storage, tick Admin Read, Data Read, Data Write
      • Can add exempted users e.g. Admin user

Logging Agent Concepts

  • Agent captures additional VM logs
    • OS logs/events
    • 3rd Party application logs
  • Logging agent-based on fluentd (open source data collector)
  • Only applicable to GCE and EC2 (AWS)
    • GKE uses Cloud Operations for GKE

Configuring the Agent

  • Per Google: The "out of the box" setup covers most use cases
  • Default installation/configuration covers:
    • OS Logs
      • Linux - syslog
      • Windows - Event viewer
    • Multiple 3rd party applications e.g. Apache, nginx, redis, rabbitmq, gitlab, jenkins, cassandra etc

Modifying Agent Logs Before Submission

  • Why modify logs?
    • Remove sensitive data
    • Reformat log fields (e.g. conbine two fields into one)
  • Additional configuration "plug-ins" can modify records
  • filter_record_transformer - most common
    • Add/modify/delete fields from logs

Agent Setup Process

  • Add Repo (via provided script)
  • Update repos
  • Install Logging Agent
  • Install configuration files
  • Start the agent

Install Logging Agent and Collect Agent Logs

curl -sSO https://dl.google.com/cloudagents/add-logging-agent-repo.sh
sudo bash add-logging-agent-repo.sh
sudo apt update
sudo apt-get install google-fluentd
sudo apt install -y google-fluentd-catch-all-config
sudo service google-fluentd start

Logging Filters

Logs Viewer Query Interface

  • View logs through queries
  • Basic and Advanced query interface
  • Basic
    • Dropdown menus - simple searches
  • Advanced
    • View across log categories - advanced search capabilities

Basic and Advanced Filter Queries

  • Different query formats
    • Search field syntax fifferent for each method
  • Basic query
    • Not case-sensitive
    • Built in field names for some logs

Advanced Filter Boolean Operators

  • Group/Exclude entries
    • AND requires all conditions are met
    • OR requires only one condition to be met
    • NOT excludes condition
  • Order of precendence (i.e. order of operations)
    • NOT -> OR -> AND
    • a OR NOT b AND NOT c OR d = (a OR (NOT B)) AND ((NOT C) OR d)
    • AND is implied

Constructing Advanced Fitler Queries

  • Generic text search = just type requested string
  • Searching fields
    • Nested JSON format
      • resource.type="gce_instance"
      • resource.labels.zone="us-central1-a"
  • Search by set severity or greater
    • severity >= WARNING
  • Filter by timestamp
    • timestamp>="2018-12-31T00:00:00Z" AND timestamp<="2019-01-01T00:00:00Z"

Hands-On with Advanced Filters

  • Create advanced search filters
  • Search across log types
  • Use AND, OR and NOT operators
  • Explore new Logging interface

VPC Flow Logs

What are VPC flow Logs?

  • Recorded sample of network flows sent/received by VPC resources
    • Near real-time recording
  • Enabled at the VPC subnet level

Use Cases

  • Network Monitoring - Understanding traffic growth for capacity forecasting
  • Forensics - who are your instances talking to?
  • Real-time security analysis
    • Integrate (i.e. export) with other security products

VPC Flow Logs - Considerations

  • Generates a large amount of, potentially chargeable, log files
  • Does not capture 100% of traffic:
    • Samples approximately 1 out of 10 packets. This cannot be adjusted
  • TCP/UDP only
  • Shared VPC - all VPC flow logs are in the host project

Firewall Logs

What are Firewall logs?

  • Logs of firewall rule effects (allow/deny)
  • Useful for auditing, verifying, and analyzing effect of rules
  • Applied per firewall rule, across entire VPC
  • Can be exported for analysis

Considerations

  • Logs every firewall connection attempt - best effort basis
  • TCP/UDP protocols only
  • Default "deny all" ingress and "allow all" egress rules are NOT logged

Viewing Deny/Allow All Logs?

  • Create an explicit firewall rules for the denied/allowed traffic you want to view - e.g. A duplicate rule to the default rule
  • Example: View all SSH attempts from outside of an allowed location
    • Create a rule to deny all TCP:22 access from all locations - enable logging
    • Assign low priority figure of 65534
    • Assign higher priority "ssh-allow" rule for allowed location in source filter

VPC Flow Logs and Firewall Logs Demo

Routing and Exporting Logs

  • Main premise - route a copy of logs from Cloud Logging to somewhere else
    • BigQuery, Cloud Storage, Pub/Sub, another logging bucket and more
  • Can export all logs, or certain logs based on defined criteria

Why Route/Export logs?

  • Long-term retention
    • Compliance requirements
  • Big data analysis
    • Analytics in BigQuery
  • Stream to other applications
    • Pub/Sub connection
  • Route to alternate retention buckets

How Routing/Exports Work

  • 3 components: Sink, Filter, Destination
  • Create a sink
    • Sink = Object for filter.destination pairing
  • Create filter (query) of logs to export
    • Can also set exclusions within filter
  • Set destination for matched logs
    • Export will only capture new logs since the export was created, not previous ones

Export logs across Folder/Organization

Must use command line (or terraform), cannot create via the web console

gcloud logging sinks create my-sink \
storage.googleapis.com/my-bucket --include-children \
--organization=(organization-ID) --log-filter="logName:activity"

Logging Export IAM Roles

  • Owner/Logs Configuration Writer - create/edit sink
  • Viewer/Logs Viewer - view sink
  • Project Editor does NOT have access to create/edit sinks

Export Logs to BigQuery

In this Demo...

Export firewall logs to BigQuery, and analyze access attempts

  • Create a sink
  • Filter by firewall logs
  • Set BigQuery dataset as destination
  • Run queries in BigQuery to view denied access attempts
#standardSQL
SELECT  
jsonPayload.connection.src_ip,
jsonPayload.connection.dest_port,
jsonPayload.remote_location.continent,
jsonPayload.remote_location.country,
jsonPayload.remote_location.region,
jsonPayload.rule_details.action
FROM `log_export.compute_googleapis_com_firewall` 
ORDER BY jsonPayload.connection.dest_port

Logs-Based Metrics

What are logs-based metrics?

  • When is a log not just a log?
    • When it is also a Cloud Monitoring metric!
  • Cloud Monitoring metrics based on defined log entries
    • Example: number of denied connection attempts from firewall logs
  • Metric is created each time log matches a defined query
  • System (auto-created) and User-defined (custom) varieties

Types of logs-based metrics

Counter and Distribution

Counter:

  • Counter of logs that match an advanced logs query
  • Example: number of logs that match specific firewall log query
  • All system logs-based metrics are counter type

Distribution:

  • Records distribution of values via aggregation methods

From logs viewer > Create Metric or Logs based Metrics menu

Section Review

Milestone: Let the Record Show

Custom logs based distribution metrics

  • More powerful, collects a number value for each event and shows you how those values are distributed over the set of events
    • Common use is to track latency
      • from each event received, a latency value is extracted from the log entry and added to the distribution
        • Concept of percentile is fundametaly about percentile

Hands-On Lab: Install and Configure Logging Agent on Google Cloud

SRE and Alerting Policies

SLOs and Alerting Strategy

Required Reading

Alerts Review - Why we need them

  • Something is not working correctly
  • Action is necessary to fix it
  • Alerts inform relevant personnel that action is necessary when specified conditions met

Alerts and SRE

  • Continued errors = Danger of violating SLAs/SLOs (error budget being used up)
  • If issue isn't fixed, error budget will be used up
  • Proper alerting policy based on Service Level Indicators (SLIs) enables us to preserve error budget
  • Balance multiple alerting parameters

Precision | Recall | Detection time | Reset time

  • Precision: Rate of 'relevant' alerts vs. low priority events
    • Does this event require immediate attention?
  • Recall: Percent of significant events detected
    • Was every 'real' event properly detected? Did we miss some?
  • Detection time: Time taken to detect significant issue
    • Longer detection time = more accurate detection, but longer duration of errors before
  • Reset time: How long alerts persist after issue is resolved
    • Longer reset time = confusion/'white noise'

How do we balance these parameters?

  • Window Length: Time period measured
    • % of errors over (x) time period
      • Example: average CPU utilization per minute vs. per hour
      • Small windows = faster alert detection, but more 'white noise'
        • CPU averaging 80% for 1-minute window
        • Hurts precision, helps recall
      • Longer windows = More precise ('real' problem vs. white noise)
        • CPU Averaging 95% for 1-hour window
        • Longer detection time
        • Good precision, poor detection time
        • Once problem determined, more error budget may already be used up
  • Duration: How long something exceeds SLIs before 'significant' event declared
    • For field e.g. 1 minute, is the duration
    • Short 'blips' vs. sustained errors over longer time period
      • Reduced 'white nose'
      • Poor recall, good precision
        • Outage of 2 minutes on 5 min duration never detected
        • Misses massive spikes in errors over shorter durations

Optimal Solution - Multiple conditions/notification channels

  • No one single alerting policy/condition can properly cover all the scenarios

Google Recommends (Condensed from the multi-page doc listed in required reading)

  • Multiple conditions
    • Long and short time windows
    • Long and short durations
  • Multiple notification channels based on severity
    • Low priority anomalies to reporting system
      • Pub/Sub topic to analysis application to look for trends
      • No immediate human interaction required
    • Major (customer-impacting) events to on-call team
      • Requires immediate escalation

Service Monitoring

wget https://raw.githubusercontent.com/linuxacademy/content-gcpro-devops-engineer/master/scripts/app-engine-quick-deploy.sh source app-engine-quick-deploy.sh

Not in exam, but interesting

GCP Console > Operations > Services

Create SLO

  • Latency (App Engine)
  • Request-based - simply counts individual events (Windows-based is more advanced, good mins vs bad mins, entries above/below the SLI)
  • Define SLI
    • Latency Threshold - 200ms - Response time, all our requests have to exceed to be within the SLI
  • Set your SLO based on SLI
    • Compliance period: Calendar : 1-day, Rolling = Any 24hr period
    • Performance Goal: 80% of good response time requests e.g. 80% of our customer requests must be 200ms or less
      • Changing this value up e.g. to 99% will affect the error budget (bring it down), that's because some requests exceed 200ms response time
  • Name: Average Customer latency - 80% SLO

Milestone: Come Together, Right Now, SRE

The services console is really valuable, because when setting the SLO, GCP already has the historical data and can show you instant feedback when determining your SLO (P95 value) - Automation built into GCP console reducing risks human might make when adding up.

  • Really great tool, not manually calculating error budgets, GCP does it all for you!

Optimize Performance with Trace/Profiler

Section Introduction

What the Services Do and Why They Matter

Cloud Trace

"A Distributed tracing system that collects latency data from your applications and displays it in the GCP Console"

Operational Management

  • Latency Management

Google's 4x Golder Signals

  • Latency

Cloud Trace: Primary Features

  • Works with App Engine, VMs and container (e.g. GKE, Cloud Run)
  • Shows general aggregated latency data
  • Shows performance degradations over time
  • Identifies bottlenecks
  • Alerts automatically if there's a big shift
  • SDK supports Java, Node.js, Ruby and GO
  • API Available to work with any source

Cloud Profiler

"Continuously analyzes the performance of CPU or memory-intensive functions executed across an application"

Cloud Profiler: Primary Features

  • Improve performance
  • Reduce costs
  • Supports Java, Node.js, Python and Go
  • Agent-based
  • Extremely low-impact
  • Profiles saved for 30 days
  • Export profiles for longer storage
  • Free!

Types of Profiling Supported

Profile Type Go Java Node.js Python
CPU time X X - X
Wall time - X X X
Heap X X X -
Allocated Heap X - - -
Contention X - - -
Threads X - - -

CPU Time - Time it takes the processor to run whatever function (in code) is being processed Wall Time - Total time (Wall clock time); Time elapsed between entering and exiting a function includes all wait time, locked and thread syncronisation Heap - Amount of memory, allocated in the programs heap at the instant the profile is collected Heap Allocation - Total amount of memory that was allocated in the programs heap during the interval between the first collection and the next collection. This value includes any memory that was allocated any has either been freed or is no longer in use Contention - Go specific - Profile mutex contention for Go, mutual exclusion lock, data access across concurrent processes. Determine the amount of time waiting for mutexs and frequency at which contention occurs Threads - Profile thread usage for Go, and capture the information on Go routines, and Go concurrency mechanisms

Tracking Latency with Cloud Trace

Set up a sample app that utilizes a Kubernetes Engine cluster and three different services to handle a request and generate a response, all the while tracking the latency with Cloud Trace. Once weve made a number of requests, well examine the results in the Cloud Trace console and Ill show you how you can check the overall latency of the request as well as breaking it down into its component parts.

Traces are gathered at every step, e.g. across each of the 3x load balancers

Enable K8's engine API Enable Cloud Trace API

Python Code for Cloud Trace


Python Code for Cloud Trace Middleware


Python Code for Cloud Trace Execute Trace

Everytime the app executes, the trace executes

Dashboard outputs the full trace, with waterfall graph showing component parts (spans - to help identify latency bottlenecks)

Accessing the Cloud Trace APIs

Instrumenting Your Code for Cloud Trace

3x methods

Client Libraries Open Telemetry OpenCensus
Ruby Node.js, Python (now) Python
Node.js Go Java
C# ASP.NET Core In Active development Go
C# ASP.NET Recommended by Google PHP

Comparing APIs

V1 vs V2
Send traces to Cloud Trace - Send traces to Cloud Trace
Update existing traces - -
Get lists of traces - -
get the details of a single trace - -
Supports v1 REST and v2 REST as well as v1 and v2 RPC - Supports v1 REST and v2 REST as well as v1 and v2 RPC
A trace is represented by a Trace resource (REST) and Trace message (RPC) - No explicit trace object; include trace ID in span to identify
A span is represented by a TraceSpan resource (REST) and TraceSpan message (RPC) - A span in represented by the Span resource
Uses Labels fields - Uses attributes fields

Setting Up Your App with Cloud Profiler

Cloud Profiler Supported Environments

  • Compute Engine
  • App Engine Standard Environment
  • App Engine Flexible Environment
  • Kubernetes Engine
  • Outside of Google Cloud

Python Code to enable Cloud Profiler to track CPU

3 Steps to Setting up Cloud Profiler Outside GCP

  • Enable profiler API inside GC Project
  • Get credentials for the profiling agent
    • Service account with private-key auth
    • Application default credentials (ADC)
  • Configure the agent
    • Pass in the project ID via a Config() or similar method

Analyzing Cloud Profiler Data

As Go is the most supported language, the demo in this video uses Go.

TLDR: Analyze Code with graphs, IO, CPU, locks, etc, cool if you're coding with running your apps within GKE or on GCP

Section Review

Cloud Trace: Latency - Demo app with 3x Load Balancers, Spans, OpenTelemetry Cloud Profiler: Continuously analyzing CPU & Memory Performance, Supported services, Code necessary for implementation, Examine flame graphs for profile types, threads, alloc etc

Milestone: It All Adds Up!

Hands-On Lab: Discovering Latency with Google Cloud Trace

  • Stand-up infrastructure
gcloud services enable cloudtrace.googleapis.com
gcloud services enable container.googleapis.com
gcloud container clusters create cluster-1 --zone us-central1-c
gcloud container clusters get-credentials cluster-1 --zone us-central1-c # If created through the WebUI
kubectl get nodes
  • Update [PROJECT_ID] in ./trace/acg-service-*.yaml*.template within the cloned repo

  • Build Container Image, Tag and push to container registry

docker build -t acg-image .
docker tag acg-image gcr.io/discovering--219-46060d10/acg-image
docker push gcr.io/discovering--219-46060d10/acg-image

I fucked the deployment by changing the tag in the deployment. The tag is referenced in the deployment spec under ./trace/app/acg-service-(a|b|c).yaml

kubectl Cheatsheet

You can force refresh containers after fixing (re-tag, re-push) the issue

cloud_user_p_cff0f65a@cloudshell:~/content-gcpro-operations/trace (discovering--219-46060d10)$ kubectl rollout restart deployment/cloud-trace-acg-c
deployment.apps/cloud-trace-acg-c restarted
cloud_user_p_cff0f65a@cloudshell:~/content-gcpro-operations/trace (discovering--219-46060d10)$ kubectl get all
NAME                                     READY   STATUS              RESTARTS   AGE
pod/cloud-trace-acg-a-5f49648db7-ztjr5   1/1     Running             0          20s
pod/cloud-trace-acg-b-6966b8cb56-ztnvh   1/1     Running             0          4s
pod/cloud-trace-acg-b-bb7c98995-bfgld    0/1     Terminating         0          16m
pod/cloud-trace-acg-c-66497849b7-xlvjm   1/1     Running             0          16m
pod/cloud-trace-acg-c-6d5d5b489d-bwdbr   0/1     ContainerCreating   0          2s
  • Run setup.sh script
bash setup.sh
  • Test the app
curl $(kubectl get svc cloud-trace-acg-c -ojsonpath='{.status.loadBalancer.ingress[0].ip}')

Identifying Application Errors with Debug/Error Reporting

Section Introduction

Troubleshooting with Cloud Debugger

"Inspect code in real time, without stopping or slowing it down" (killer feature)

  • Multiple Source Options
    • Cloud Source repositories
    • GitHub
    • BitBucket
    • GitLab
  • Code Search
    • Quickly find code in a specific file, function, method or by line number
  • Code Share
    • Debug sessions can be shared with any teammate for collaborative debugging
  • IDE Integration
    • Integrates with IntelliJ IDE, as well as VSCode and Atom

Completely Free to Use

Cloud Debugger Language Support Per Service

Key Workflows

Two Key Cloud Debugger Tools

Snapshots vs. Logpoints
- Captures application state at a specific line location - - Inject logs into running apps without redeployment
- Captures local variables - - Logpoints remain active for 24 hours if not deleted or service not redeployed
- Captures call stack - - Supports canarying
- Take snapshots conditionally (Java, Python and Go) - - Add Logpoints conditionally
- Supports canarying - Output ent to target's appropriate environment

Demo of Code debugging in Python

Establishing Error Reporting for Your App

Error Reporting Supported Languages

  • php

  • Java

  • Python

  • .NET

  • Node.js

  • Ruby

  • Go

  • Compute Engine

    • Cloud Logging
    • Error Reporting API
  • App Engine

    • Automatic set-up
    • Standard Environment:
      • Additional set-up may be neccessary
      • Only errors with stack trace processed
    • Flexible Environment:
      • No additional setup required
      • Analyzes messages to stderr
  • Cloud Run

    • Automatic setup
    • Analyzes messages written to stderr, stdout, or other logs that have stack trace
  • Kubernetes Engine

    • Use Cloud Logging
    • Use Error Reporting API
  • Cloud Functions

    • Automatic setup
    • Unhandled JavaScript exceptions processed

Working with the Error Reporting API in Python

  1. Import error_reporting library
  2. Instantiate Client()
  3. Call report() method, passing in any string

Python example in Video

Managing Errors and Handling Notifications

Error Reporting Entries Explained

Occurances (withing selected time/day range) Users affected (Within selected time/day range) Error (Extracted from stack trace with code location, linked if in CSR) Seen In (service and version, if any) First Seen (time/date first appeared) Last Seen Response Code (HTTP Status code, if any) Link to issue URL (Optional text field)

Error Reporting Notifications

  • Enabled per project
  • Must have Project Owner, Project Editor, Project Viewer role, or custom role with cloudnotifications.activities.list permission
  • Sent to email of specified roles
  • May be forwared to alias or Slack channel
  • Send to mobile app, if enabled and subscribed (Cloud Console App)

Section Review

Milestone: Come Together - Reprise (Debug Is De Solution)

Hands-On Lab: Correcting Code with Cloud Debugger

gcloud services list --available | grep -i debug
gcloud services enable clouddebugger.googleapis.com
git clone https://github.com/linuxacademy/content-gcpro-operations
cd content-gcpro-operations/debugger/
gcloud app deploy # This failed first time, as GCP hadn't finished setting up services, just re-run and was successful
gcloud app browse # Opens link to the app
  • Update Code because guru fails to return properly reversed

Fix code using debugger, set info point and use breakpoints

Course Conclusion

Milestone: Are We There, Yet?

Practice Exam / Quiz: Google Certified Professional Cloud DevOps Engineer Exam Prep