Rules and thresholds have governed IT monitoring and supervision for the past few decades. Those of you who monitor infrastructures everyday probably see a rules-based approach as an ally. Using traditional monitoring tools we use a combination of logical expressions to set alerts – “If X happens, then do Y” to address each known issue.
As we build on our infrastructures we discover new issues and keep adding rules and thresholds – scale ups or large corporates can be managing anywhere between a few hundred to thousands of “manual monitors”.
This approach although fundamentally flawed, may have worked with moderate inefficiencies on monolithic infrastructures. But with modern age ephemeral infrastructures and the corresponding explosion in generated data (logs/metrics/traces) continuing to use a rules based approach might prove to be catastrophic.
Let’s sum up the principal issues we encounter using a rule based approach and how we can do better using technology i.e Artificial Intelligence.
When we use a rules-based approach, the first thing we think about is to define a set of rules to monitor the different parameters of the IT infrastructure. At first glance, this seems simple and obvious. But have you thought about creating rules for exceptions, i.e. events that have never happened before? It is indeed very complicated to create rules manually to deal with all possible scenarios, because as soon as this set of rules encounters an exception in the IT environment, the logic for which the rule was originally programmed stops working. In order to cope with this situation, a new rule must be created. And the above is an endless process riddled with confusion and complexity which must be avoided at all costs.
If we take scale ups or large enterprises as a case in point, we can easily compare the rules-based approach to a myriad of complexity. It’s easy to deduce that it’s impossible to try to deal with all the alerts that would be generated due to a rules-based approach.
What if the only solution to deal with a large number of scenarios was to correspondingly increase the portfolio of rules ? If we think for a second this linear relationship fails to be pragmatic in real life scenarios.
This approach will not only increase complexity, as it will be harder to determine whether rules are consistent with each other, but calculating the potential combinations for a set of rules is a factorial function and brings us to a mathematical problem. For example, with five rules, there are 120 possible combinations: 5! i.e. 5*4*3*2*1 = 120. With six rules, there are 740. Ten rules generate 3,628,800 possible combinations. Now imagine thousands of rules!
Another challenge arises with the rules-based approach: testing the portfolio of rules to ensure constant accuracy. Each combination of rules must be checked to avoid false positives or missing critical incidents. Data scientists call this the “NP-complete problem” as there is not enough computing power to meet this requirement.
If you were to retain only one thing from this paragraph: saying that using a rules-based system for IT operations is “simple” would be a mistake! For complex infrastructures, they make monitoring and remediation extremely complicated.
We may have come to a conclusion that rules are not as simple to put in place as they seem. So why still use them? Is it because it is less expensive than other available solutions for example investing in AI?
If we go back to the reasoning described above, we see that to use a rules based monitoring approach it is necessary to:
Understanding the true cost of rules therefore involves an endless process of creation, verification and revision. It is a daily maintenance problem of gargantuan size.
Moreover, who in your IT team is trained and qualified to manage this maintenance? Given the volume, the constant updating of rules and their interactions, only experts are able to handle this workload. And to manage it effectively you would need an army of them.
With the proliferation of containers and microservices that appear and disappear on demand, the number of required rules has exploded. When rules don’t work as expected, or when there are conflicts between rules, their accuracy suffers and engineers are inundated with irrelevant alerts.
To address the fatigue generated by the flow of alerts, analysts often stop using rules for correlating events. This results in higher costs generated due to downtime and poor NPS.
This type of scenario may suggests that the actual solution lies in reducing the number of rules. You will have guessed that this strategy can be risky, because the monitoring tool only reveals a partial indication of a new problem, and if the indicator for this problem is disabled, the problem will not be detected. By the time an incident becomes critical, engineers will never have seen the problem coming. It will then be too late.
The rules are therefore much more complex and costly than we think.
All these issues are driving many companies to turn to autonomous monitoring solutions using artificial intelligence and machine learning to solve the problems that rules are supposed to address (but cannot address). These tools allow IT teams to process all data, even exceptions, without being limited by rules.
A stand-alone monitoring tool eliminates the need to create rules for every possible combination of events. Instead, it can ingest all operational data from your infrastructure and automatically apply algorithms to determine which events are important and which are not.
Unlike a rules-based system, a solution based on machine learning is becoming essential to ensure optimal performance of IT infrastructures. Taking a step back from the rule-based approach will open up new perspectives and make the daily life of your SRE’s easier!
PacketAI is the world’s first autonomous monitoring solution built for the modern age. Our solution has been developed after 5 years of intensive research in French & Canadian laboratories and we are backed by leading VC’s. To know more, book a free demo and get started today!