In today’s digital economy, businesses must, more than ever, rely on data to make strategic decisions. However, the question that business leaders need to ask is: what data would help me uncover useful insights 🤔 ? Obviously, high level business data, such as: conversion rate, churn rate, sales per region, retention .. are a must, however, what usually gets ignored is system (IT) data. If you run a digital business, you would surely know that the predictability of your revenue stream is highly tied to the robustness of your IT infrastructure. In other words, any infrastructure malfunction can directly impact your revenue, brand and customers trust. Hence, paying attention to IT data (reflecting your IT stack health), and correlating it to your high level business metrics, should be a top priority.
IT data comes in two types: metrics (numerical 📈) and logs (textual 📄). Metrics over time (or time series) describe usage, status and health of your infra over time. Logs over time (or log sequences) describe in more detail the inner workings of the many software components making up your infrastructure. Together, metrics and logs, over time tell an accurate story of your business. Now, how can one use this data to get useful insights for better business decisions? Well, one way is to detect anomalies in IT data and use them as early indicators of business impact. In this series, we focus on log data. We already explored anomaly types that could be detected here. In the this article in particular, we dig deeper into log sequence anomalies.
Imagine an e-commerce web app to sell socks 🧦 online, called sockshop, built as a microservices architecture and deployed on kubernetes. The app uses a mongodb database and rabbitmq as a message broker. Now, imagine that the app has one recurrent user (Sockshopper) who uses the app once every six months on average. Since Sockshopper is a legit customer, he has a normal behaviour that can be described as: log-in, scroll over, click on some sock pictures, maybe log-out, or continue to scroll over, click on some more sock pictures, then maybe add a sock to cart, or just log out, or add a sock to cart then pay then log-out ..etc the collection of all these atomic behaviours constitues a behaviour tree and reflects the normal behaviour of a random sockshopper 👨 🧦 .
Say the app gains in popularity and grows to thousands of sockshoppers, now the individual behaviours add up making for a collective behaviour, made of millions of atomic individual behaviours.
What is interesting is that the logs emitted by the sockshop components reflect quite accurately the sockshoppers behaviour. Every log-in, add to cart, scroll over, payment .. is logged, often at more than one place: database logs, web proxy logs, app logs …etc. The collection of log streams from different components constitute the log stream of the sockshop stack.
In theory, an idle stack would preserve the same global log stream unless influenced by an outside force: user behaviour (resource consumption, input data ..), changes made by engineers or hardware degradation. Each one of these outside forces lead to a deviation from the normal behaviour of the stack, hence a deviation from its normal log stream.
Consequentially, anomalies in the log stream (out of sequence logs) of the stack indicates either a hardware degradation, a resource problem (cpu, memory .. related to customer usage), a customer abnormal behaviour (benign like holiday peaks or malicious like cyber attacks) or an unwanted change impact.
In order to detect anomalies in your stack’s log stream (stack’s behaviour), PacketAI learns first the normal behaviour of your stack to establish a baseline. We do this by training a patented semi-supervised learning engine on log streams for a period of time ranging from 1 to 2 weeks (long enough to capture seasonal behaviours). Once the engine is trained, it starts observing logs in real-time and alerts whenever there is a “statistically significant” deviation from the normal behaviour. Note that a requirement for this framework is the Log2template engine described in the previous article.
The “statistically significant” part is super important, as the engine is probabilistic and not deterministic. A tolerance margin is built-in to tolerate small deviations that could reflect small changes, small upward or downward trends in resource usage, log collection defects … etc.
On the other hand, a “statistically significant” anomaly does not necessarily indicate a business-impacting problem. It only means that considering the log stream data, the anomalous sequence is significantly different from the established baseline. For this reason, additional steps are necessary to gauge the anomaly before firing up an alert, including:
This feature is in private beta and will be soon released publicly, register here to try it for free, along with many other anomaly detection features.
Logs come in sequences that reflect the behaviour of your stack in real-time. Detecting anomalies in log sequences is a great way to identify potential problems and risks to your business, be it unwanted change consequences, weird user behaviour or infrastructure degradation. PacketAI provides you the ability to do just that and try it out takes you about 5min. Register here!
If you like this content please let me know in a comment, you can also clap 👏 and follow us for more 🙂
Oh, one more thing, remember the sockshop app we talked about? it actually exists (thanks to our friends at Weaveworks) and we will use it to demo our anomaly detection features in future posts, stay tuned 👋 !
PacketAI is the world’s first autonomous monitoring solution built for the modern age. Our solution has been developed after 5 years of intensive research in French & Canadian laboratories and we are backed by leading VC’s. To know more, book a free demo and get started today!