Part 4 Cloud trace

2021-02-16 16:45:01 +00:00 · 2021-02-16 16:45:01 +00:00 · 30838121dc
commit 30838121dc
parent e51419baa7
3 changed files with 338 additions and 27 deletions
--- a/Part_4.md
+++ b/Part_4.md
@ -859,48 +859,359 @@ sudo service google-fluentd start

 #### Hands-On with Advanced Filters

+- Create advanced search filters
+- Search across log types
+- Use AND, OR and NOT operators
+- Explore new Logging interface
+
 #### VPC Flow Logs

+<u>What are VPC flow Logs?</u>
+
+- Recorded sample of network flows sent/received by VPC resources
+  - Near real-time recording
+- Enabled at the VPC subnet level
+
+<u>Use Cases</u>
+
+- Network Monitoring - Understanding traffic growth for capacity forecasting
+- Forensics - who are your instances talking to?
+- Real-time security analysis
+  - Integrate (i.e. export) with other security products
+
+<u>VPC Flow Logs - Considerations</u>
+
+- Generates a large amount of, potentially chargeable, log files
+- Does not capture 100% of traffic:
+  - **Samples approximately 1 out of 10 packets**. This cannot be adjusted
+- TCP/UDP only
+- Shared VPC - all VPC flow logs are in the host project
+
 #### Firewall Logs

+<u>What are Firewall logs?</u>
+
+- Logs of firewall rule effects (allow/deny)
+- Useful for auditing, verifying, and analyzing effect of rules
+- Applied per firewall rule, across entire VPC
+- Can be exported for analysis
+
+<u>Considerations</u>
+
+- Logs every firewall connection attempt - best effort basis
+- TCP/UDP protocols only
+- Default "deny all" ingress and "allow all" egress rules are **NOT** logged
+
+<u>Viewing Deny/Allow All Logs?</u>
+
+- Create an explicit firewall rules for the denied/allowed traffic you want to view - e.g. A duplicate rule to the default rule
+- Example: View all SSH attempts from outside of an allowed location
+  - Create a rule to deny all TCP:22 access from all locations - enable logging
+  - Assign low priority figure of 65534
+  - Assign higher priority "ssh-allow" rule for allowed location in source filter
+
 #### VPC Flow Logs and Firewall Logs Demo

 #### Routing and Exporting Logs

+- Main premise - route a copy of logs from Cloud Logging to somewhere else
+  - BigQuery, Clous Storage, Pub/Sub, another logging bucket and more
+- Can export all logs, or certain logs based on defined criteria
+
+<u>Why Route/Export logs?</u>
+- Long-term retention
+  - Compliance requirements
+- Big data analysis
+  - Analytics in BigQuery
+- Stream to other applications
+  - Pub/Sub connection
+- Route to alternate retention buckets
+
+<u>How Routing/Exports Work</u>
+- 3 components: Sink, Filter, Destination
+- Create a sink
+  - Sink = Object for filter.destination pairing
+- Create filter (query) of logs to export
+  - Can also set exclusions within filter
+- Set destination for matched logs
+  - Export will only capture new logs since the export was created, not previous ones
+
+<u>Export logs across Folder/Organization</u>
+
+Must use command line (or terraform), cannot create via the web console
+```
+gcloud logging sinks create my-sink \
+storage.googleapis.com/my-bucket --include-children \
+--organization=(organization-ID) --log-filter="logName:activity"
+```
+
+<u>Logging Export IAM Roles</u>
+- Owner/Logs Configuration Writer - create/edit sink
+- Viewer/Logs Viewer - view sink
+- Project Editor does NOT have access to create/edit sinks
+
 #### Export Logs to BigQuery

+<u>In this Demo...</u>
+
+Export firewall logs to BigQuery, and analyze access attempts
+- Create a sink
+- Filter by firewall logs
+- Set BigQuery dataset as destination
+- Run queries in BigQuery to view denied access attempts
+
+```
+#standardSQL
+SELECT  
+jsonPayload.connection.src_ip,
+jsonPayload.connection.dest_port,
+jsonPayload.remote_location.continent,
+jsonPayload.remote_location.country,
+jsonPayload.remote_location.region,
+jsonPayload.rule_details.action
+FROM `log_export.compute_googleapis_com_firewall` 
+ORDER BY jsonPayload.connection.dest_port
+```
 #### Logs-Based Metrics

+<u>What are logs-based metrics?</u>
+- When is a log not just a log?
+  - When it is also a Cloud Monitoring metric!
+- Cloud Monitoring metrics based on defined log entries
+  - Example: number of denied connection attempts from firewall logs
+- Metric is created each time log matches a defined query
+- System (auto-created) and User-defined (custom) varieties
+
+<u>Types of logs-based metrics</u>
+
+**Counter and Distribution**
+
+Counter:
+- Counter of logs that match an advanced logs query
+- Example: number of logs that match specific firewall log query
+- All system logs-based metrics are counter type
+
+Distribution:
+- Records distribution of values via aggregation methods
+
+From logs viewer > `Create Metric` or `Logs based Metrics` menu
+
 #### Section Review

 #### Milestone: Let the Record Show

+Custom logs based distribution metrics
+- More powerful, collects a number value for each event and shows you how those values are distributed over the set of events
+  - Common use is to track latency
+    - from each event received, a latency value is extracted from the log entry and added to the distribution
+      - Concept of percentile is fundametaly about percentile
+
 #### Hands-On Lab: Install and Configure Logging Agent on Google Cloud

-SRE and Alerting Policies
-SLOs and Alerting Strategy
-Service Monitoring
-Milestone: Come Together, Right Now, SRE
-Optimize Performance with Trace/Profiler
-Section Introduction
-What the Services Do and Why They Matter
-Tracking Latency with Cloud Trace
-Accessing the Cloud Trace APIs
-Setting Up Your App with Cloud Profiler
-Analyzing Cloud Profiler Data
-Section Review
-Milestone: It All Adds Up!
-Hands-On Lab: Discovering Latency with Google Cloud Trace
-Identifying Application Errors with Debug/Error Reporting
-Section Introduction
-Troubleshooting with Cloud Debugger
-Establishing Error Reporting for Your App
-Managing Errors and Handling Notifications
-Section Review
-Milestone: Come Together - Reprise (Debug Is De Solution)
-Hands-On Lab: Correcting Code with Cloud Debugger
-Course Conclusion
-Milestone: Are We There, Yet?
-landscape
-Practice Exam / Quiz:
-Google Certified Professional Cloud DevOps Engineer Exam Prep
+### SRE and Alerting Policies
+
+#### SLOs and Alerting Strategy
+
+<u>Required Reading</u>
+- Google SRE Workbook - "Alerting on SLO's"
+- https://sre.google/workbook/alerting-on-slos/
+
+<u>Alerts Review - Why we need them</u>
+- Somethign is not working correctly
+- Action is necessary to fix it
+- Alerts inform relevant personnel that action is necessary when specified conditions met
+
+<u>Alerts and SRE</u>
+- Continued errors = Danger of violating SLAs/SLOs (error budget being used up)
+- If issue isn't fixed, error budget will be used up
+- Proper alerting policy based on Service Level Indicators (SLIs) enables us to preserve error budget
+- Balance multiple alerting parameters
+
+Precision | Recall | Detection time | Reset time
+
+- Precision: Rate of 'relevant' alerts vs. low priority events
+  - Does this event require immediate attention?
+- Recall: Percent of significant events detected
+  - Was every 'real' event properly detected? Did we miss some?
+- Detection time: Time taken to detect significant issue
+  - Longer detection time = more accurate detection, but longer duration of errors before
+- Reset time: How long alerts persist after issue is resolved
+  - Longer reset time = confusion/'white noise'
+
+<u>How to we balance these parameters?</u>
+- Window Length: Time period measured
+  - % of errors over (x) time period
+    - Example: average CPU utilization per minute vs. per hour
+    - Small windows = faster alert detection, but more 'white noise'
+      - CPU averaging 80% for 1-minute window
+      - Hurts precision, helps recall
+    - Longer windows = More precise ('real' problem vs. white noise)
+      - CPU Averaging 95% for 1-hour window
+      - Longer detection time
+      - Good precision, poor detection time
+      - Once problem determined, more error budget may already be used up
+- Duration: How long something exceeds SLIs before 'significant' event declared
+  - `For` field e.g. 1 minute, is the duration
+  - Short 'blips' vs. sustained errors over longer time period
+    - Reduced 'white nose'
+    - Poor recall, good precision
+      - Outage of 2 minutes on 5 min duration never detected
+      - Misses massive spikes in errors over shorter durations
+
+<u>Optimal Solution - Multiple conditions/notification channels</u>
+- No one single alerting policy/condition can properly cover all the scenarios
+
+**Google Recommends (Condensed from the multi-page doc listed in required reading)**
+- Multiple conditions
+  - Long and short time windows
+  - Long and short durations
+- Multiple notification channels based on severity
+  - Low priority anomalies to reporting system
+    - Pub/Sub topic to analysis application to look for trends
+    - No immediate human interaction required
+  - Major (customer-impacting) events to on-call team
+    - Requires immediate escalation
+
+#### Service Monitoring
+
+wget https://raw.githubusercontent.com/linuxacademy/content-gcpro-devops-engineer/master/scripts/app-engine-quick-deploy.sh
+source app-engine-quick-deploy.sh
+
+Not in exam, but interesting
+
+GCP Console > Operations > Services
+
+Create SLO
+- Latency (App Engine)
+- Request-based - simply counts individual events (Windows-based is more advanced, good mins vs bad mins, entries above/below the SLI)
+- Define SLI
+  - Latency Threshold - 200ms - Response time, all our requests have to exceed to be within the SLI
+- Set your SLO based on SLI
+  - Compliance period: Calendar : 1-day, Rolling = Any 24hr period
+  - Performance Goal: 80% of good response time requests e.g. 80% of our customer requests must be 200ms or less
+    - Changing this value up e.g. to 99% will affect the error budget (bring it down), that's because some requests exceed 200ms response time
+- Name: Average Customer latency - 80% SLO
+
+#### Milestone: Come Together, Right Now, SRE
+
+The services console is really valuable, because when setting the SLO, GCP already has the historical data and can show you instant feedback when determining your SLO (P95 value) - Automation built into GCP console reducing risks human might make when adding up.
+
+- Really great tool, not manually calculating error budgets, GCP does it all for you!
+### Optimize Performance with Trace/Profiler
+
+#### Section Introduction
+
+#### What the Services Do and Why They Matter
+
+<u>Cloud Trace</u>
+
+_"A Distributed tracing system that collects **latency data** from your applications and displays it in the GCP Console"_
+
+<u>Operational Management</u>
+
+- Latency Management
+
+<u>Google's 4x Golder Signals</u>
+
+- Latency
+
+<u>Cloud Trace: Primary Features</u>
+- Works with App Engine, VMs and container (e.g. GKE, Cloud Run)
+- Shows general aggregated latency data
+- Shows performance degradations over time
+- Identifies bottlenecks
+- Alerts automatically if there's a big shift
+- SDK supports Java, Node.js, Ruby and GO
+- API Available to work with any source
+
+<u>Cloud Profiler</u>
+
+_"**Continuously analyzes** the performance of **CPU or memory-intensive** functions executed across an application"_
+
+<u>Cloud Profiler: Primary Features</u>
+
+- Improve performance
+- Reduce costs
+- Supports Java, Node.js, Python and Go
+- Agent-based
+- Extremely low-impact
+- Profiles saved for 30 days
+- Export profiles for longer storage
+- Free!
+
+<u>Types of Profiling Supported</u>
+
+| Profile Type   | Go  | Java | Node.js | Python |
+| -------------- | --- | ---- | ------- | ------ |
+| CPU time       | X   | X    | -       | X      |
+| Wall time      | -   | X    | X       | X      |
+| Heap           | X   | X    | X       | -      |
+| Allocated Heap | X   | -    | -       | -      |
+| Contention     | X   | -    | -       | -      |
+| Threads        | X   | -    | -       | -      |
+
+CPU Time - Time it takes the processor to run whatever function (in code) is being processed
+Wall Time - Total time (Wall clock time); Time elapsed between entering and exiting a function includes all wait time, locked and thread syncronisation
+Heap - Amount of memory, allocated in the programs heap at the instant the profile is collected
+Heap Allocation - Total amount of memory that was allocated in the programs heap during the interval between the first collection and the next collection. This value includes any memory that was allocated any has either been freed or is no longer in use
+Contention - Go specific - Profile mutex contention for Go, mutual exclusion lock, data access across concurrent processes. Determine the amount of time waiting for mutexs and frequency at which contention occurs
+Threads - Profile thread usage for Go, and capture the information on Go routines, and Go concurrency mechanisms
+
+#### Tracking Latency with Cloud Trace
+
+Set up a sample app that utilizes a Kubernetes Engine cluster and three different services to handle a request and generate a response, all the while tracking the latency with Cloud Trace. Once we’ve made a number of requests, we’ll examine the results in the Cloud Trace console and I’ll show you how you can check the overall latency of the request as well as breaking it down into its component parts.
+
+Traces are gathered at every step, e.g. across each of the 3x load balancers
+
+Enable K8's engine API
+Enable Cloud Trace API
+
+[Python Code for Cloud Trace](./img/cloud_trace_python_code.png)
+<br>
+[Python Code for Cloud Trace Middleware](./img/cloud_trace_python_middleware.png)
+<u></u>
+<u></u>
+
+#### Accessing the Cloud Trace APIs
+
+<u></u>
+<u></u>
+
+#### Setting Up Your App with Cloud Profiler
+
+<u></u>
+<u></u>
+
+#### Analyzing Cloud Profiler Data
+
+<u></u>
+<u></u>
+
+#### Section Review
+
+#### Milestone: It All Adds Up!
+
+#### Hands-On Lab: Discovering Latency with Google Cloud Trace
+
+### Identifying Application Errors with Debug/Error Reporting
+
+#### Section Introduction
+
+#### Troubleshooting with Cloud Debugger
+
+#### Establishing Error Reporting for Your App
+
+#### Managing Errors and Handling Notifications
+
+#### Section Review
+
+#### Milestone: Come Together - Reprise (Debug Is De Solution)
+
+#### Hands-On Lab: Correcting Code with Cloud Debugger
+
+### Course Conclusion
+
+#### Milestone: Are We There, Yet?
+
+#### Practice Exam / Quiz: Google Certified Professional Cloud DevOps Engineer Exam Prep
+
--- a/img/cloud_trace_python_code.png
+++ b/img/cloud_trace_python_code.png
--- a/img/cloud_trace_python_code_middleware.png
+++ b/img/cloud_trace_python_code_middleware.png