gpcdesre/Part_4.md

Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)
===============================================================================================

### Introduction 

#### About the Course and Learning Path

Make better software, faster

#### Milestone: Getting Started

### Understanding Operations in Context

#### Section Introduction

#### What Is Ops?

- GCP Defined

  - _"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"_

  - Logging Management  
    - Gather Logs, metrics and traces everywhere
      - Audit, Platform, User logs
        - Export, Discard, Ingest

  - Error Reporting
    - So much data, How do you pick out the important indicators?
      - A centralized error management interface that shows current & past errors
      - Identify your app's top or new errors at a glance, in a dedicated dashboard
      - Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace


  - Across-the-board Monitoring
    - Dashboards for built-in and customizable visualisations
      - Monitoring Features
        - Visual Dashboards
    - Health Monitoring
      - Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
    - Service Monitoring
      - Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)

  - SRE Tracking
    - Monitoring is critical for SRE
      - Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
        - Pinpoint SLI's and develop and SLO on top of it

  - Operational Management
    - Debugging
      - Inspects the state of your application at any code location in production without stopping or slowing down requests
    - Latency Management
      - Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
    - Performance Management
      - Offers continuous profiling of resource consumption in your production applications along with cost management
    - Security Management
      - With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud

**What is Ops: Key Takeaways**

- Ops Defined: Watch, learn and fix
- Primary services: Monitoring and Logging
- Monitoring dashboads for all metrics, including health and services (SLOs)
- Logs can be exported, discarded, or ingested
- SRE depends on ops
- Error alerting pinpoints problems, quickly

Scratch:
- Metric query and tracing analysis
- Establish performance and reliability indicators
- Trigger alerts and error reporting
- Logging Features
- Error Reporting
- SRE Tracking (SLI/SLO)
- Performance Management


#### Clarifying the Stackdriver/Operations Connection

- 2012 - Stackdriver Created
- 2014 - Stackdriver Acquired by Google
- 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
- 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name

**Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs** (formerly all called Stackdriver <service>)

"StackDriver" lives on - in the exam only

Integration + Upgrades

- Complete UI Integrations
  - All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
- Dashboard API
  - New API added to allow creation and sharing of dashoards across projects
- Log Retention Increased
  - Logs can now be retained for up to **10 years** and you have control over the time specified
- Metrics Enhancement
  - In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
- Advanced Alert Routing
  - Alerts can now be routed to independent systems that support Cloud Pub/Sub

#### Operations and SRE: How Do They Relate?

- Lots of questions in Exam on SRE

What is SRE? - _"SRE is what happens when a software engineer is tasked with what used to be called operations"_ (Founder Google SRE Team)

**Pillars of DevOps**

- Accept failure as normal:
  - Try to anticipate, but...
  - Incidents bound to occur
  - Failures help team learn

- No-fault postmortems &  SLOs:
  - No two failures the same
  - Track incidents (SLIs)
  - Map to Objectives (SLOs)

- Implement gradual change:
  - Small updates are better
  - Easier to review
  - Easier to rollback

- Reduce costs of failures:
  - Limited "canary" rollouts
  - Impact fewest users
  - Automate where possible

- Measure everything:
  - Critical guage of sucess
  - CI/CD needs full monitoring
  - Synthetic, proactive monitoring

- Measure toil and reliability:
  - Key to SLOs and SLAs
  - Reduce toil, up engineering
  - Monitor all over time

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

SLI: _"A carefully defined <u>quantitative</u> measure of some aspect of the level of service that is provided"_

SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing

Example SLIs:
- Request Latency: How long it takes to return a response to a request
- Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
- Batch Throughput - Proportion of time = data processing rate > than a threshold

**Commit to Memory - Google's 4x Golden Signals!**

- Latency
  - The time is takes for your service to fulfill a request
- Errors    
  - The rate at which your service fails
- Traffic
  - How much demand is directed at your service
- Saturation
  - A measure of how close to fully utilized the services' resources are

> **LETS**

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

SLO: _"Service level objectives (SLOs) specify a target level for the reliability of your service"_ - The site reliability workbook

SLOs are tied to you SLIs
- Measured by SLLI
- Can be a single target value or range of values
- SLIs <= SLO
- or
- (lower bound <= SLI <= upper bound) = SLO
- Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)

SLI - Metric over time which detail the health of a service
  - example: `Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile`

SLO - Agreed-upon bounds how often SLIs must be met
  - example: `95% percentile homepage SLI will suceed 99.9% of the time over the next year`

Phases of Service Lifetime

SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:
- Measure and track SLIs (Measuring increasing performance)
- Evaluate reliability
- Define SLOs
- Build capacity models
- Establish incident response, shared with dev team

General Availability Phase
- After Production Readiness Review passed
- SREs handle majority of op work
- Incident responses
- Track operational load and SLOs

**Ops & SRE: Key Takeaways**
- SRE: Operations from a software engineer
- Many shared pillars between DevOps/SRE
- SLIs are quantitative metrics over time
- Remember the 4x Google Golden Signals (LETS)
- SLOs are a target objective for reliability
- SLIs are lower then SLO - or - in-between upper and lower bound
- SREs are most active in limited availability and general availability phases


#### Operation Services at a Glance

> 10,000ft view!

- Cloud Monitoring - Provides visibility into the performance, uptime, and overall health of cloud-powered applications
- Cloud Logging - Allows you to store, search, analyze, monitor and alert on log data and events from GCP 

> Gives SRE's the ability to evaluate SLI's and keep on track with SLO's

Help make operations run smoother, reliabily and more efficiently

- Cloud Debugger - When and not if the system encounters issues, lets you inspect the state of a running application in real-time without interference
- Cloud Trace - Maps out exactly how your service is processing the various <u>requests</u> and responses it receives, all while tracking latency
- Cloud Profiler - Keeps an eye on your codes performance - Continuously gather CPU usage and memory-allocation information from your production applications, looking for bottlenecks

Ops Tools Working Together 

**Gather Information**
- Collect signals
  - Through Metrics
    - Apps
    - Services
    - Platform
    - Microservices
  - Logs
    - Apps
    - Services
    - Platform
  - Trace
    - Apps

**Handle Issues**
 - Alerts
 - Error Reporting
 - SLO

**Troubleshoot**

**Display and Investigate**
- Dashboards
- Health Checks
- Log Viewer
- Service Monitoring
- Trace
- Debugger
- Profiler

5x Main Services Are:
- Cloud Monitoring
- Cloud Logging
- Cloud Debugger
- Cloud Trace
- Cloud Profiler

> All work together to gather info, manage instance and troubleshoot, by allowing you to display the collective signals and analyze them

#### Section Review

Monitoring, troubleshooting and improving applications performance.

#### Milestone: The Weight of the World (Teamwork, Not Superheroes)

### Monitoring Your Operations

#### Section Introduction

#### Cloud Monitoring Concepts

> Measures key aspects of your services

- Captures resource and application signal data
  - Metrics
  - Events
  - Metadata

What questions does Cloud Monitoring answer?
- How well are my resources performing?
- Are my applications meeting their SLAs?
- Is something wrong that requires immediate action?

Workspace - 'Single pane of glass' for viewing and monitoring data across projects
Installed Agents (optional) - Additional application-specifc signal data
Alerts - Notify someone when something needs to be fixed

#### Monitoring Workspaces Concepts

What is a monitoring Workspace?
- GCP's organization tool for monitoring GCP and AWS resources
  - All montioring tools live in a Workspace
    - Uptime checks
    - Dashboards
    - Alerts
    - Charts

```
Monitoring Workspace (within a project) ------ (One of More GCP Projects) ------< GCP Projects
```

- Workspace exists in a Host Project - Create project first, then Monitoring Workspace within the project
- Projects can only contain a single Workspace
- Workspace name same as host project (cannot be changed) - Workspace name will be same as `project_id` (worth having a project dedicated to the workspace name)
- Workspace can monitor resources in same projects and/or other projects
- Workspace can monitor multiple projects simultaneously 
- Projects associated with a single workspace
- Projects can be moved from one workspace to another
- Two different workspaces can be merged into a single workspace
- Workspace access other projects' metric data, but data 'lives' in those projects

**Project Grouping Strategies** - No single correct answer

Single workspace can monitor upto 200 projects

- A single workspace for all app-related projects "app1"
  - Pro: Single pane of glass for all application/resource data for single app
  - Con: If need to restrict dev/prod teams from viewing data in each others projects. Too broad access
- 2x Workspaces for viewing projects in each tier e.g. Dev/Prod/QA/Test
  - Limited scope so can restrict teams
  - More than one place to investigate should you need to
- Single Workspace per project
  - Maximum isolation
  - More limited view

Monitoring Workspaces IAM roles
- Applied to workspace project to view add projects' monitoring data
- Monitoring Viewer/Editor/Admin
  - Viewer = Read-only access (view metrics)
  - Editor = Edit workspace, write-access to monitor console and API
  - Admin = Full access including IAM roles
- Monitoring Metric Writer = Service Account role
  - Permit writing data to workspace
  - Does not provide read access

#### Monitoring Workspaces

- Creating a workspace

  - `Operations > Monitoring > Overview`
  - Select a project, and workspace is created for that project 

- Adding projects to a workspace
  - Within Overview > `Settings > GCP Projects Section > Add GCP Projects`

- Moving projects between workspaces
  - Goto Workspace where project is located that you want to move
    - `Settings > 'GCP Projects' Section > Select 3x Little dots on line of project you want to move > 'Move to another workspace'`
      - If you're moving a project where you've created custom dashboards in the current workspace, these will be lost

- Merging workspaces
  - Navigate into the workspace that you want to merge into (`ws-1`):
    - `Settings > MERGE` and select the workspace (`ws-2`) you want to move into your current workspace (`ws-1`)
      - Merge `ws-2` into `ws-1` whereby `ws-2` is deleted as part of the process

Note: Cloud playgrounds will only have a single project

#### Perspective: Workspaces in Context

How can you make all the data easier to manage? e.g. Don't dump everything into one workspace

#### What Are Metrics?

- Workspace provides visibility into everything that is happening in your GCP resources
- Within Workspace, we view information using:
  - Metrics
  - ...viewed in Charts...
  - ...grouped in Dashboards...

What are Metrics?
- Raw data GCP uses to create charts
- Over 1000 pre-created metrics
- Can create custom metrics
  - Built in Monitoring API
  - OpenCensus - open source library to create metrics
- Best practice: don't create a custom metric where a default metric already exists

Anatomy of a Metric
- Value Types - metric data type
  - BOOL - boolean
  - INT64 - 64-bit integer
  - DOUBLE - Double precision float
  - STRING - a string

- Metric Kind - relation of values
  - Guage - Measure specific instant in time (e.g. CPU Utilization)
  - Delta - Measure change since last recording (e.g. requests count since last data point)

#### Exploring Workspace and Metrics

> Video just does a quick walkthrough of the dashboard and creating a chart with the metrics explorer

#### Monitoring Agent Concepts

- CLoud Monitoring collects lots of metrics with no additional configuration needed
  - "Zero config monitoring"
  - Examples: CPU utilization, network traffic
- More granular metrics can be collected using an optional monitoring agent
  - Memory usage
  - 3rd Party app metrics (Nginx, Apache)

- Separate agent for both monitoring and logging
- Monitoring Agent = collectd
- Logging Agent = fluentd

Which GCP (and AWS) Services Need Agents?
- Not all compute services require (or even allow installation of) agents
- Services that support agent installation:
  - Compute Engine
  - AWS EC2 instances (required for all metrics)
- Services which don't support agent installation:
  - <u>Everything else</u>
  - Managed services either have an agent already installed or simply don't require one
  - Google Kubernetes Engine has Cloud Operations for GKE pre-installed

Installing the Agent - General Process
- Add installation location as repo
- Update repos
- Install agen from repo ('apt install ...')
- (Optional) Configure agent for 3rd party application

#### Installing the Monitoring Agent

- Manually install and configure monitoring agent on Apache web server
- Demonstrate how to automate the agent setup process

The below commands will create an instance with a custom web page. The first instance does not have the agent installed, and the second instance is with the agent installed. You will need to allow port 80 on your VPC firewall if it's not already enabled, scoped to your instance tag (command included for reference).

Create the firewall rule, allowing port 80 access on the default VPC (modify if using a custom VPC):

```
gcloud compute firewall-rules create default-allow-http --direction=INGRESS --priority=1000 --network=default --action=ALLOW --rules=tcp:80 --source-ranges=0.0.0.0/0 --target-tags=http-server
```

Create a web server without an agent:
```
gcloud beta compute instances create website-agent --zone=us-central1-a --machine-type=e2-micro --metadata=startup-script-url=gs://acg-gcloud-course-resources/devops-engineer/operations/webpage-config-script.sh --tags=http-server --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=website-agent
```

Create web server and install an agent: (The Automated Example)
```
gcloud beta compute instances create website-agent --zone=us-central1-a --machine-type=e2-micro --metadata=startup-script-url=gs://acg-gcloud-course-resources/devops-engineer/operations/webpage-config-with-agent.sh --tags=http-server --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=website-agent
```

To manually install the agent (with Apache configuration) on an instance:
- Add the agent's package repository
```
curl -sSO https://dl.google.com/cloudagents/add-monitoring-agent-repo.sh
sudo bash add-monitoring-agent-repo.sh
sudo apt-get update
```
- To install the latest version of the agent, run:
```
sudo apt-get install stackdriver-agent
```
- To verify that the agent is working as expected, run:
```
sudo service stackdriver-agent status
```
- On your VM instance, download apache.conf and place it in the directory /opt/stackdriver/collectd/etc/collectd.d/:
```
(cd /opt/stackdriver/collectd/etc/collectd.d/ && sudo curl -O https://raw.githubusercontent.com/Stackdriver/stackdriver-agent-service-configs/master/etc/collectd.d/apache.conf)
```
- Restart the monitoring agent:
```
sudo service stackdriver-agent restart
```

#### Collecting Monitoring Agent Metrics

- Generate traffic to our Apache web server
- View web server metrics in Cloud Monitoring Workspace

> When you have the agent installed, you can see 2x new tabs depending on what's been configured. In the example "Agent" & "Apache" are visible with associated metrics that wouldn't otherwise be there

#### Integration with Monitoring API

What is the Monitoring API?
- Manipulate metrics utilizing Google Cloud API's
- Accessible via external services
  - Monitoring dashboards in Grafana
  - Export metric data to BigQuery
- Create and share dashboards with programmatic REST and gRPC syntax
  - Efficient than hand-created dashboards from scratch

External Integration Use Cases
- Keep metrics for long-term analysis/storage
  - Cloud Monitoring holds metrics for six weeks
- Share metric data with other analytic platforms (e.g. Grafana)

How does Metrics Integration Work?
- Short version: Export/Integrate via Cloud Monitoring API
- If exporting metrics:
  - Define metrics with metric descriptor (JSON format)
  - Export via Monitoring API to BigQuery
- If using 3rd Party service (e.g. Grafana), authenticate to Cloud Monitoring API with service account

![Example BigQuery Export](./img/example_big_query_export.png)
<br>
![Metric Descriptor json](./img/metric_json_descriptor.png)
<br>
![HTTPS REST API call for Grafana](./img/api_format_grafana.png)


Programmatically Create Dashboards

- Create and export dashboard via Monitoring API
- Dashboard configuration represented in JSON format
- Create dashboards and configurations via `gcloud` command or directly via REST API
- `gcloud monitoring dashboards create --config-from-file=[file_name.json]`

#### Create Dashboards with Command Line

- Create custom dashboards via command with JSON config files
  - Use both gcloud command or directly via REST API
- Export current dashboard into JSON configuration file

> This is cool, because we can create and hold our dashboards under source control

- Download configuration file
```
wget https://raw.githubusercontent.com/GoogleCloudPlatform/monitoring-dashboard-samples/master/dashboards/compute/gce-vm-instance-monitoring.json
```

- Create dashboard with gcloud command and config file:
```
gcloud monitoring dashboards create --config-from-file=gce-vm-instance-monitoring.json
```

- Do same thing via REST API

```
curl -X POST -H "Authorization: Bearer $(gcloud auth application-default print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
https://monitoring.googleapis.com/v1/projects/(YOUR-PROJECT-ID-HERE)/dashboards -d @gce-vm-instance-monitoring.json
```

- Export current dashboard

Create shell variables for Project ID and Project Number
```
export PROJECT_ID=$(gcloud config list --format 'value(core.project)')

export PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)")
```

- Export dashboard using above variables. Substitute your dashboard ID and exported file name where appropriate
Export current dashboard by dashboard ID

```
gcloud monitoring dashboards describe \
projects/$PROJECT_NUMBER/dashboards/$DASH_ID --format=json > your-file.json
```

#### GKE Metrics

- "Enable Cloud Operations for GKE" - Simple tickbox to turn on

_ GKE natively integrated with both Cloud Monitoring and Logging
- Ability to toggle 'Cloud Operations for GKE' in cluster settings, enabled by default
- Cloud Operations for GKE replaces older 'Legacy Monitoring and Logging'
- Integrates with Prometheus

What K8's Metrics are Collected?

- k8s_master_component
- k8s_cluster
- k8s_node
- k8s_pod
- k8s_container

If you need a 'one-click' script to build out a demonstration GKE web application from scratch, copy and paste the below command. It will download and execute a script to build out the environment. The process will take about five minutes to complete:

```
wget https://raw.githubusercontent.com/linuxacademy/content-gcpro-devops-engineer/master/scripts/quick-deploy-to-gke.sh

source quick-deploy-to-gke.sh
```

#### Perspective: What's Up, Doc?

- Uptime checks are very valuable

#### Uptime Checks

What are Uptime Checks?

- Periodic request sent to a monitor resource and waits for a response (or is "up")
- Check uptime of:
  - VMs
  - App Engine services
  - Website URLs
  - AWS Load Balancer

- Create uptime check via Cloud Monitoring
- Optionally, create an alert to notify is uptime check fails
- IMPORTANT: Uptime checks are subject to firewall access
  - Allow access to the uptime check IP range

#### Establishing Human-Actionable and Automated Alerts

Why do we care about Alerts?
- Sometimes, things break
- No one wants to endlessly stare at dashboards for something to go wrong
- Solution: Use Alerting Policy to notify you if something goes wrong

Alerting Policy Components
- Conditions - describes conditions to trigger an alert
  - Metrics threshold exceeded/not met
  - Create an incident when thresholds are violated
- Notifications - who to notify when the alerting policy is triggered
- (optional) - Documentation - included in notifications with action steps

Incident Handling
- Alerting event occurs when alerting policy conditions are violated
- Creates Incident in Open state
- Incident can then be Acknowledged (investigated) and Closed (resolved)

Alerting Policy IAM Roles
- Uses Cloud Monitoring roles to create an alerting policy
- Monitoring Editor, Admin, Project Owner
- Monitoring Alert Policy Editor - minimal permissions to create an alert via Monitoring API


> GCP creates actual incidents under alerts, that you can "resolve"

#### Section Review

Monitoring Your Operations

- Cloud Monitoring Concepts
- Monitoring Workspaces
- What are Metrics?
- Exploring Workspaces and Metrics
- Monitoring Agent
- Monitoring API and CLI usage
- GKE Metrics - Master to individual containers
- Uptime Checkes
- Establishing Human-Actionable and Automated Alerts

#### Milestone: Spies Everywhere! (Check Those Vitals!)

#### Hands-On Lab: Install and Configure Monitoring Agent with Google Cloud Monitoring

### Logging Activities

#### Section Introduction

Logging Activities: See next headings

#### Cloud Logging Fundamentals

What is Cloud Logging?
- Cloud Operations service for storing, viewing, and interacting with logs:
  - Reading and writing logs entries
  - Query logs
  - Export to other services (internal to GCP and external)
  - Create metrics from logs
- Interact with Logs Viewer and API
- Multiple log types available
- Logs used by other Cloud Operations services (debug, error reporting, etc)

What is a log?
- Record of status or event (string format)
  - "What happened?"
- Log Entry - individual logs in a collection
- Log Payload - contents of the Log Entry
  - Contains nested Fields

Logs Retention Period
- Varies by log type:
  - Admin Activity, System Event, Access Transparency
    - 400 days
    - Non-configurable
  - All other log types:
    - 30 days by default
    - Configurable retention period

IAM Roles
- Generic and service account varieties
- Service Account:
  - Logs Writer: Write logs, no view permissions
  - Logs Bucket Writer: Write logs to logs buckets
- Logs Viewer - View logs except Data Access/Access Transparency (known as private logs)
- Private Logs Viewer - View all of the above
- Logs Configuration Writer - Create logs-based metrics, buckets, views and export sinks
  - 'Change configruations'
- Logging Admin - Full access to all logging actions
- Project Viewer - View all logs except Data Access/Access Transparency
- Project Editor 0 Write, view and delete logs. Create logs-based metrics
  - Cannot create export sinks or view Data Access/Access Transparency logs
- Project Owner - all logging-based permissions

#### Log Types and Mechanics

<u>Scope of Collecting and Viewing Logs</u>
- Scoped by project
- View `project-1` logs in `project-1`
- No built-in "single pane of glass"
- Can export logs org-wide or multiple projects

<u>Log Types - Primary Categories</u>
Security Logs vs. Non-security Logs
Always Enabled (non-configurable) vs. Mnaually Enabled (configurable):
- Always Enabled/REquired
  - No change
  - 400 days retention
- Manually Enabled logs
  - Charged based on log amount
  - 30 days retention (configurable)
Above categories overlap

<u>Security Logs</u>

Audit logs and Access transparency logs
- "Who did what? where? and when?"
- Also accessible via Activity Log

Admin Activity | System Event | Data Access

Admin Activity
- Records user-initiated resource configuration
- "GCE instance created by (user)"
- "GCS Bucket deleted by (user)"
- Always Enabled

System Event
- Admin (non-user) initiated configuration calls
- Always Enabled

Data Access
- Record configuration (create/modify/read) of resource data
- "Object (x) was created in bucket (y) by (users)"
- Must be manually enabled (except BigQuery)
- Not applicable to public resources

<u>Access Transparency Logs</u>
- Only applicable for Enterprise or paid support plans
- Logs og Google personnel access to your resources/data
  - Example: Support request for VM instance
  - Records action and access of support personnel
- Always Enabled for applicable support plans

| Log Type            | System or User configured | Records what?               | Default Setting                              |
| ------------------- | ------------------------- | --------------------------- | -------------------------------------------- |
| Admin Activity      | User-initiated            | Resource Configuration      | Always Enabled                               |
| System Event        | System-initiated          | Resource Configuration      | Always Enabled                               |
| Data Access         | User-initiated            | Resource Data Configuration | Manually Eanble                              |
| Access Transparency | User-initiated            | Google personnel access     | Always Enabled (on applicable support plans) |

<u>'Everything Else' Logs</u>

Logs to Debug, Monitor and Troubleshoot:
- Chargeable
- User Logs - generated by software/applications
  - Require Logging Agent
- Platform logs - logs generated by GCP services
  - Example: GCE startup script
- VPC Flow Logs
- Firewall Logs

<u>Logs Pricing and Retention</u>

- Always Enabled logs have no charge with 400 days retention
  - Admin Activity, System Event, Access Transparency
- ALL other logs are chargeable with configurable retention period (default 30 days)
- Pricing = $0.50/GB

#### Cloud Logging Tour

Data access logs - Add, edit, view object in a bucket
- Enabled through IAM > Audit Logs
  - To enable on single service, find the service e.g. Google Cloud Storage, tick Admin Read, Data Read, Data Write
    - Can add exempted users e.g. Admin user

#### Logging Agent Concepts

- Agent captures additional VM logs
  - OS logs/events
  - 3rd Party application logs
- Logging agent-based on fluentd (open source data collector)
- Only applicable to GCE and EC2 (AWS)
  - GKE uses Cloud Operations for GKE

Configuring the Agent
- Per Google: The "out of the box" setup covers most use cases
- Default installation/configuration covers:
  - OS Logs
    - Linux - syslog
    - Windows - Event viewer
  - Multiple 3rd party applications e.g. Apache, nginx, redis, rabbitmq, gitlab, jenkins, cassandra etc

<u>Modifying Agent Logs Before Submission</u>
- Why modify logs?
  - Remove sensitive data
  - Reformat log fields (e.g. conbine two fields into one)
- Additional configuration "plug-ins" can modify records
- `filter_record_transformer` - most common
  - Add/modify/delete fields from logs

Agent Setup Process
- Add Repo (via provided script)
- Update repos
- Install Logging Agent
- Install configuration files
- Start the agent

#### Install Logging Agent and Collect Agent Logs

```
curl -sSO https://dl.google.com/cloudagents/add-logging-agent-repo.sh
sudo bash add-logging-agent-repo.sh
sudo apt update
sudo apt-get install google-fluentd
sudo apt install -y google-fluentd-catch-all-config
sudo service google-fluentd start
```

#### Logging Filters

<u>Logs Viewer Query Interface</u>

- View logsa through queries
- Basic and Advanced query interface
- Basic
  - Dropdown menus - simple searches
- Advanced
  - View across log categories - advanced search capabilities

<u>Basic and Advanced Filter Queries
- Different query formats
  - Search field syntax fifferent for each method
- Basic query
  - Not case-sensitive
  - Built in field names for some logs

<u>Advanced Filter Boolean Operators</u>
- Group/Exclude entries
  - AND requires all conditions are met
  - OR requires only one condition to be met
  - NOT excludes condition
- Order of precendence (i.e. order of operations)
  - NOT -> OR -> AND
  - a OR NOT b AND NOT c OR d = (a OR (NOT B)) AND ((NOT C) OR d)
  - AND is implied

<u>Constructing Advanced Fitler Queries</u>
- Generic text search = just type requested string
- Searching fields
  - Nested JSON format
    - resource.type="gce_instance"
    - resource.labels.zone="us-central1-a"
- Search by set severity or greater
  - `severity >= WARNING`
- Filter by timestamp
  - `timestamp>="2018-12-31T00:00:00Z" AND timestamp<="2019-01-01T00:00:00Z"`

#### Hands-On with Advanced Filters

#### VPC Flow Logs

#### Firewall Logs

#### VPC Flow Logs and Firewall Logs Demo

#### Routing and Exporting Logs

#### Export Logs to BigQuery

#### Logs-Based Metrics

#### Section Review

#### Milestone: Let the Record Show

#### Hands-On Lab: Install and Configure Logging Agent on Google Cloud

SRE and Alerting Policies
SLOs and Alerting Strategy
Service Monitoring
Milestone: Come Together, Right Now, SRE
Optimize Performance with Trace/Profiler
Section Introduction
What the Services Do and Why They Matter
Tracking Latency with Cloud Trace
Accessing the Cloud Trace APIs
Setting Up Your App with Cloud Profiler
Analyzing Cloud Profiler Data
Section Review
Milestone: It All Adds Up!
Hands-On Lab: Discovering Latency with Google Cloud Trace
Identifying Application Errors with Debug/Error Reporting
Section Introduction
Troubleshooting with Cloud Debugger
Establishing Error Reporting for Your App
Managing Errors and Handling Notifications
Section Review
Milestone: Come Together - Reprise (Debug Is De Solution)
Hands-On Lab: Correcting Code with Cloud Debugger
Course Conclusion
Milestone: Are We There, Yet?
landscape
Practice Exam / Quiz:
Google Certified Professional Cloud DevOps Engineer Exam Prep