Early SRE Ethnographic Research
As a UX researcher, one of my jobs is to observe users to find certain problematic patterns in their behavior. My goal for identifying these patterns is to try and tease out the cause of the problem. As I mentioned in my previous blog, I am learning about this new species of the internet world: SREs. Though discovery is an ongoing activity, I believe I have enough research to begin compiling several symptoms into specific classifications. In the first of this series, I’ll discuss the symptoms of one common problem SREs have; alert fatigue.
Symptoms:
Skeptical of their alerts
“Is this alert I’m getting actually critical?”
SREs seem to be both in demand and in a constant state of distraction. A curious observation that I’ve witnessed is a morbid skepticism to their alerts. Sometimes, an SRE recognizes a critical alert and so jumps on it immediately. But many times he/she does not believe their alerts, which leads to an odd state of flux where the SRE either rewrites alerts and/or verifies every alert on their own.
Willful blindness. Ignoring emails, notifications, and messages
“I don’t check my emails anymore.”
Another level above skepticism is willful blindness. Because SREs get so many alerts about their production and non-production environments, including notifications, they are swamped by a massive quantity of data. It’s important to note here is that notifications and alerts are different. In general, SREs need to know what’s going on in order to understand the context of their environment, but, they only need to be alerted when things are actually approaching a danger stage. In my observation, I find that SREs talk about the difference between abnormal and bad but, in practice, there doesn’t seem to be a good solution.
Laptop carrier — everywhere!
“I have to take my computer to the woods.”
In addition to bringing their tent, sleeping bag, and water bottle, we see SREs bringing a laptop on camping trips. These SREs will save your environment, but will they save themselves if the laptop takes the place of their snake bite kit? I think this shows SREs the monitoring process is not adequate unless they are in the middle of it all.
Conclusions:
Alerts and notifications are certainly critical for minimizing the time to respond to production incidents. However, new situations and loose policies often lead to a sprawl of alerts. It’s clear to me that great alert hygiene is needed to help SREs take effective actions, but I don’t have any silver bullets for how to tackle this task. Any SREs who have thoughts in this area, I’d love to speak with you!
This article was originally published for kaizenOps.io on Sep 19, 2017