Site Reliability Engineering – Evolution of Automation

Subscribe: Apple Podcasts | Spotify | TuneIn | RSS

We explore the evolution of automation as we continue studying Google’s Site Reliability Engineering, while Michael, ah, forget it, Joe almost said it correctly, and Allen fell for it.

The full show notes for this episode are available at https://www.codingblocks.net/episode187.

News

Thank you for the new reviews: ASobering, rupeshbende, Mnmbrane, angry_little_hamster, jonsmith1982
- Want to help out the show? Leave us a review!
rupeshbende asks: How do you find time to do this along with your day job and hobbies as this involves so much studying on your part?

Survey Says

Automation

Why Do We Automate Things?

Cover of the "Site Reliability Engineering" book from O'Reilly — The famous “SRE Book” from Google

Consistency: Humans make mistakes, even on simple tasks. Machines are much more reliable. Besides, tasks like creating accounts, resetting passwords, applying updates aren’t exactly fun.
Platform: Automation begets automation, smaller tasks can be tweaked or combined into bigger ones.
- Pays dividends, providing value every time it’s used as opposed to toil which is essentially a tax.
- Platforms centralize logic too, making it easier to organize, find, and fix issues.
- Automation can provide metrics, measurements that can be used to make better decisions.
Faster Repairs: The more often automation runs, it hits the same problems and solutions which brings down the average time to fix. The more often the process runs, the cheaper it becomes to repair.
Faster Actions: Automations are faster than humans. Many automations would be prohibitively expensive for humans to do,
Time Saving: It’s faster in terms of actions, and anybody can run it.

If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings. Think The Matrix with less special effects and more pissed off System Administrators.
Joseph Bironas

The Value of SRE at Google

Google has a strong bias for automation because of their scale.
Google’s core is software, and they don’t want to use software where they don’t own the code and they don’t want processes in place that aren’t automated. You can’t scale tribal knowledge.
- They invest in platforms, i.e. systems that can be improved and extended over time.

Google’s Use Cases for Automation

Much of Google’s automation is around managing the lifecycle of systems, not their data.
They use tools such as chef, puppet, cfengine, and PERL(!?).
The trick is getting the right level of abstraction.
Higher level abstractions are easier to work with and reason about, but are “leaky”.
- Hard to account for things like partial failures, partial rollbacks, timeouts, etc.
The more generic a solution, the easier it is to apply more generally and tend to be more reusable, but the downside is that you lose flexibility and resolution.

The Use Cases for Automation

Google’s broad definition of automation is “meta-software”: software that controls software.
Examples:
- Account creation, termination,
- Cluster setup, shutdown,
- Software install and removal,
- Software upgrades,
- Configuration changes, and
- Dependency changes

A Hierarchy of Automation Classes

Ideally you wouldn’t need to stitch systems together to get them to work together.
Systems that are separate, and glue code can suffer from “bit rot”, i.e. changes to either system can work poorly with each other or with the havoc.
- Glue code is some of the hardest to test and maintain.
There are levels of maturity in a system. The more rare and risky a task is, the less likely it is to be fully automated.

Maturity Model

When your levels of abstraction get to be very sophisticated, you can lose the ability to work effectively at a lower level. Kind of like trying to make your own toaster today (Gizmodo).

No automation: database failover to a new location manually.
Externally maintained system-specific automations: SRE has a couple commands they run in their notes.
Externally maintained generic system-specific automation: SRE adds a script to a playbook.
Internally maintained system-specific automation: the database ships with a script.
System doesn’t need automation: Database notices and automatically fails over.

Can you automate so much that developers are unable to manually support systems when a (very rare) need occurs?

Resources we Like

Links to Google’s free books on Site Reliability Engineering (sre.google)
Site Reliability Engineering book (sre.google)
- Chapter 7: The Evolution of Automation at Google (sre.google)
Ultimate List of Programmer Jokes, Puns, and other Funnies (Medium)
Shared success in building a safer open source community (blog.google)
One Man’s Nearly Impossible Quest to Make a Toaster From Scratch (Gizmodo)
The Man Who Spent 17 Years Building The Ultimate Lamborghini Replica In His Basement Wants to Sell It (Jalopnik)

Tip of the Week

There’s an easy way to seeing Mongo queries that are running in your Spring app, by just setting the appropriate logging level like: logging.level.org.springframework.data.mongodb.core.MongoTemplate=DEBUG
- This can be easily done at runtime if you have actuators enabled: (Spring)
There’s a new, open-core product from Grafana called OnCall that helps you manage production support. Might be really interesting if you’re already invested in Grafana and a lot of organizations are invested in Grafana. (Grafana)
How can you configure your Docker container to run as a restricted user? It’s easy! (docs.docker.com)
- User <user>[:<group>]
- User <UID>[:<GID>]
iOS – Remember the days of being about to rearrange your screens in iTunes? Turns out you still can, but in iOS. Tap and hold the dots to rearrange them! (support.apple.com)

Share the joy

Navigation