Analytics
10 min
Repetitive and monotonous activities that are necessary but don't add any significant value or growth to the businesses, are defined as toil activities.
In its SRE workbook, Google defines toil as;
"The kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. "
Before we can begin sorting out the causes and finding solutions, you must first know how to identify and measure toil.
Identifying toil requires a basic understanding of the characteristics of a routine task. This means evaluating the task based on;
Next, you can begin to measure the toil you identify, which is basically computing the amount of time that engineers spent on each activity.
Toil can be measured by analyzing various trends, such as;
These analyses allow you to prioritize toil, striking a balance between routine operational tasks and production tasks.
The goal for any organization is to reduce toil to less than 50% of SRE's time to keep them focused on production.
Alert management systems should make use of as much automation as possible.
If alerts require manual, repetitive resolution, simply managing them will quickly grow tiring.
A lot of system notifications are helpful but don't represent any threat or even a need to take action—for instance, a system alert telling you that web requests are twice as high as usual at six in the morning can be useful, but it requires no intervention.
With no automation, having to manually dismiss trivial alerts won't just eat away at time and patience, it can cause people to end up missing or ignoring the important alerts that do require a manual response.
This scenario creates toil, and automation is key to reducing that toil, playing a role throughout every phase of alert configuration.
If you can automate a response to an alert, then you should do so.
Without proper configuration, your alert system may end up generating no alerts at all or far too many.
Sensitivity issues are to blame in both cases, with the architecture being either overly sensitive or under-sensitive.
As an example, if you set a database service's response time degradation to 100ms as its absolute value, the slightest change will produce too many alerts to handle.
Instead of making alert conditions marginal, set a relative value, and make sure it's no less than 50%.
If you're dealing with alert configuration on the other end of the spectrum—i.e., under-sensitive alert conditions—this leads to even more pressing issues.
Without alerts, a system can have issues that go undetected by your teams, potentially leading to major outages while they work to track down the root cause that they never got an alert about.
If you're dealing with under-sensitivity, you may need to re-engineer the system to fully characterize and address the problem.
The SRE golden signals are latency, traffic, errors, and saturation. You should also consider the other variations, like RED (Rate, Error, Durability) and USE (Utilization, Saturation, Errors) to measure performance.
These signals help SREs monitor a system, and failing to include them can lead to poor configuration of a database, CPU, and memory utilization within your alert management system.
If your system experiences an average load of just 1.5x its normal rotations per second on the CPU, the system can end up sending a handful of alerts since it's not properly optimized.
If you ignore basic saturation levels, you'll face abnormalities, which will likely lead to outages at some point.
Last, but most certainly not least, if alerts don't include sufficient information, that's a sign that your system is failing to process instructions properly, so it's not giving you specifics about the situation.
This leads to unusual toil when a team has to figure out where a problem is happening and what's contributing to it.
For example, if a team member gets an alert stating "CPU utilization high," this does not give sufficient information. Sufficient information may include a hostname or IP address.
However, with just minimal information in this example alert, an engineer would not be able to respond.
They'd have to open their alert management system console, find the location of the server, and then go about troubleshooting, whereas the alert should have allowed them to go straight to a solution.
As you can see, reducing toil is easy, as long as you understand the variety of factors that can cause it.
Ultimately, your alert management system can work just as planned, provided you put in the right time and effort to configure it from the start.
Enhancing this matter would help you greatly in reducing operational toil, thus you'll be able to get the most out of every alert received.
Consider avoiding all these pitfalls by reaching our professional team and ultimately increase the productivity of your team.