Site Reliability Engineering – (Still) Monitoring Distributed Systems

Subscribe: Apple Podcasts | Spotify | TuneIn | RSS

We finished. A chapter, that is, of the Site Reliability Engineering book as Allen asks to make it weird, Joe has his own pronunciation, and Michael follows through on his promise.

The full show notes for this episode are available at https://www.codingblocks.net/episode186.

News

Thank you for the review, Lospas!
- Want to help out the show? Leave us a review!
Another great post from @msuriar, this time about the value of hiring junior developers. (suriar.net)

Survey Says

More about Monitoring Less

Cover of the "Site Reliability Engineering" book from O'Reilly — The famous “SRE Book” from Google

Instrumentation and Performance

Need to be careful and not just track times, such as latencies, on medians or means.
- A better way is to bucketize the data as a histogram, meaning to count how many instances of a request occurred in the given bucket, such as the example latency buckets in the book of 0ms – 10ms, 10ms – 30ms, 30ms-100ms, etc.

Choosing the Appropriate Resolution for Measurements

The gist is that you should measure at intervals that support the SLO’s and SLA’s.
- For example, if you’re targeting a 99.9% uptime, there’s no reason to check for hard-drive fullness more than once or twice a minute.
Collecting measurements can be expensive, for both storage and analysis.
- Best to take an approach like the histogram and keep counts in buckets and aggregate the findings, maybe per minute.

As Simple as Possible, No Simpler

It’s easy for monitoring to become very complex:
- Alerting on varying thresholds and measurements,
- Code to detect possible causes,
- Dashboards, etc.
Monitoring can become so complex that it becomes difficult to change, maintain, and it becomes fragile.
Some guidelines to follow to keep your monitoring useful and simple include:
- Rules that find incidents should be simple, predictable and reliable,
- Data collection, aggregation and alerting that is infrequently used (the book said less than once a quarter) should be a candidate for the chopping block, and
- Data that is collected but not used in any dashboards or alerting should be considered for deletion.
Avoid attempting to pair simple monitoring with other things such as crash detection, log analysis, etc. as this makes for overly complex systems.

Tying these Principles Together

Google’s monitoring philosophy is admittedly maybe hard to attain but a good foundation for goals.
Ask the following questions to avoid pager duty burnout and false alerts:
- Does the rule detect something that is urgent, actionable and visible by a user?
- Will I ever be able to ignore this alert and how can I avoid ignoring the alert?
- Does this alert definitely indicate negatively impacted users and are there cases that should be filtered out due to any number of circumstances?
- Can I take action on the alert and does it need to be done now and can the action be automated? Will the action be a short-term or long-term fix?
- Are other people getting paged about this same incident, meaning this is redundant and unnecessary?
Those questions reflect these notions on pages and pagers:
- Pages are extremely fatiguing and people can only handle a few a day, so they need to be urgent.
- Every page should be actionable.
- If a page doesn’t require human interaction or thought, it shouldn’t be a page.
- Pages should be about novel events that have never occurred before.
It’s not important whether the alert came from white-box or black-box monitoring.
It’s more important to spend effort on catching the symptoms over the causes and only detect imminent causes.

Monitoring for the Long Term

Monitoring systems are tracking ever-changing software systems, so decisions about it need to be made with long term in mind.
Sometimes, short-term fixes are important to get past acute problems and buy you time to put together a long term fix.

Two case studies that demonstrate the tension between short and long term fixes

Bigtable SRE

Originally Bigtable’s SLO was based on an artificial, good client’s mean performance.
Bigtable had some low level problems in storage that caused the worst 5% of requests to be significantly slower than the rest.
These slow requests would trip alerts but ultimately the problems were transient and unactionable.
People learned to de-prioritize these alerts, which sometimes were masking legitimate problems.
Google SRE’s temporarily dialed back the SLO to the 75th percentile to trigger fewer alerts and disabled email alerts, while working on the root cause, fixing the storage problems.
By slowing the alerts it gave engineers the breathing room they needed to deep dive the problem.

Gmail

Gmail was originally built on a distributed process management system called Workqueue which was adapted to long-lived processes.
Tasks would get de-scheduled causing alerts, but the tasks only affected a very small number of users.
The root cause bugs were difficult to fix because ultimately the underlying system was a poor fit.
Engineers could “fix” the scheduler by manually interacting with it (imagine restarting a server every 24 hours).
Should the team automate the manual fix, or would this just stall out what should be the real fix?
These are 2 red flags: Why have rote tasks for engineers to perform? That’s toil. Why doesn’t the team trust itself to fix the root cause just because an alarm isn’t blaring?
What’s the takeaway? Do not think about alerts in isolation. You must consider them in the context of the entire system and make decisions that are good for the long term health of the entire system.

Resources we Like

Links to Google’s free books on Site Reliability Engineering (sre.google)
Onboarding and mentoring (suriar.net)

Tip of the Week

Use LanCache to make the most of your network when you host your next LAN party. (LanCache.net)
Watch Project Farm [over] analyze everything from windshield wipers to drywall anchors. (YouTube)
Python has built in functionality for dynamically reloading modules: Reloading modules in Python. (GeeksForGeeks)
Dockerfile tips-n-tricks:
- Concatenate RUN statements like RUN some_command && some_other_command instead of splitting it out into two separate RUN command strings to reduce the layer count.
- Prefer apk add --no-cache some_package over apk update && apk add some_package to reduce the layer and image size. And if you’re using apt-get instead of apk, be sure to include apt-get clean as the final command in the RUN command string to keep the layer small.
- When using ADD and COPY, be aware that Docker will need the file(s)/directory in order to compute the checksum to know if a cached layer already exists. This means that while you can ADD some_url, Docker needs to download the file in order to compute the checksum. Instead, use curl or wget in a RUN statement when possible, because Docker will only compute the checksum of the RUN command string before executing it. This means you can avoid unnecessarily downloading files during builds (especially on a build server and especially for large files). (docs.docker.com)

Share the joy

Site Reliability Engineering – (Still) Monitoring Distributed Systems

Sponsors

News

Survey Says

Did you intern or co-op while you were in school?