Linux Academy takedown by CC big brains

This commit is contained in:
Alex Soul 2021-02-09 17:35:47 +00:00
parent 172df4a0f0
commit 1bec7162af
3 changed files with 56 additions and 1 deletions

View File

@ -14,6 +14,12 @@ PCDE are responsible for efficient development operations that can balance servi
Useful Links
[1 - Medium Blog](https://sathishvj.medium.com/notes-from-my-google-cloud-professional-devops-engineer-certification-exam-60d23aca37f5)
[2 - Exam Portal - Book Exam](https://www.webassessor.com/wa.do?page=login&branding=GOOGLECLOUD)
[3 - Sample Questions](https://docs.google.com/forms/d/e/1FAIpQLSdpk564uiDvdnqqyPoVjgpBp0TEtgScSFuDV7YQvRSumwUyoQ/viewform)
[4 - More Links and Resources from [1]](https://github.com/sathishvj/awesome-gcp-certifications/blob/master/professional-cloud-devops-engineer.md)
[5 - Google SRE](https://sre.google/sre-book/release-engineering)
[6 - Google SRE](https://sre.google/sre-book/monitoring-distributed-systems/)
https://myblockchainexperts.org/gcpfreepracticequestions/
### What is the business of Software Development?

View File

@ -606,8 +606,25 @@ Why do we care about Alerts?
- No one wants to endlessly stare at dashboards for something to go wrong
- Solution: Use Alerting Policy to notify you if something goes wrong
#
Alerting Policy Components
- Conditions - describes conditions to trigger an alert
- Metrics threshold exceeded/not met
- Create an incident when thresholds are violated
- Notifications - who to notify when the alerting policy is triggered
- (optional) - Documentation - included in notifications with action steps
Incident Handling
- Alerting event occurs when alerting policy conditions are violated
- Creates Incident in Open state
- Incident can then be Acknowledged (investigated) and Closed (resolved)
Alerting Policy IAM Roles
- Uses Cloud Monitoring roles to create an alerting policy
- Monitoring Editor, Admin, Project Owner
- Monitoring Alert Policy Editor - minimal permissions to create an alert via Monitoring API
> GCP creates actual incidents under alerts, that you can "resolve"
#### Section Review

32
qwiklabs_1.md Normal file
View File

@ -0,0 +1,32 @@
1. Reliability is the most important feature
2. Users, not monitoring decide reliability
3. Well-engineered...
1. software = 99.9%
2. operations = 99.99%
3. business = 99.999%
> Each additional "9" improves reliability by 10 times, but rough rule of thumb, costs business 10x more
Best to think of downtime or reliability in the inverse e.g. How much downtime is permissiable within a given timespan (error budget)
28-day error budget
99.9% = 40 minutes - Human manageable e.g. Humans see alerts, respond and fix them
99.99% = 4 minutes - System needs to detect and self-heal complete outages, because not enough time to loop in a human
9.999% = 24 seconds - Restrict rate of change, margin for error is tiny, would probably need to rebuild monitoring system from the ground up, because metrics not available for the timescale
> Reliability of a system is it's most important feature
SLO: Service Level Objective
- If reliability is a feature, when do you prioritize it vs other features?
- Useful for Product owners and Execs
- Setting a target allows all parts of organisation ability to determine if the system is reliable or not
- Acknowledging that a specific quantity of unreliability is acceptable provides a budget for failure that can be spent on developing and launching new features the remaining budget provides a signal to feed into your planning cycles to ensure work to improve reliability is prioritized
- Everyone must agree that the target accurately represents the desired experience of your users
- A problem with building new features quickly is that there's often a strong negative correlation between development velocity and system reliability
- A missed reliability target signals when too many things users care about have been broken by excessive development velocity
- SLO's can help development teams answer is when moving fast and breaking things how fast is too fast
- If everyone agrees the SLO represents the point at which you are no longer meeting the expectations of your users then broadly speaking being well within SLO is a signal that you can move faster without causing those users pain conversely burning most or in the worst cases multiples of your error budget means you have to lift your foot off the accelerator