digital transition with AI

PacketAI is an AIops startup that helps companies monitor, anticipate and troubleshoot infrastructure incidents faster using AI. In this blog post, we share some of our engineering best practices –focused on testing -both software and models, that we have put in place to ensure our platform can work at scale handling petabytes of production data in real time.  

In the last blog we focussed on engineering best practices to write resilient code for software. In this part we would focus on how to test and deploy large scale ML models on production environments. Hope you have fun reading it.  


Testing ML code and models. 

At PacketAI, our event-streaming pipeline includes serving ML models at scale. Put simply, data flows from the client infrastructure straight into the pipeline, which is then fed to various processing blocks that are responsible for applying ML models and writing results back to the pipeline or to OpenSearch

What is true of software is true of ML software : it should be testable and tested automatically. There is a big difference, however, between the two : a “build” is pure code in the case of classical software, while it is a mixture of both code and data (model parameters) for ML software. In other words, ML production builds have at least 2 reasons to change (and therefore regress), as it could be a change introduced in the code or in the data setup used for learning. So ML testing must take this into account. 


MLops framework

That is why we have built a CI/CD ML testing platform, which is responsible for triggering a unit test suite whenever ML software or models change. ML software unit tests are no different from regular unit tests : using data fixtures, they check that some boolean assertions still hold after code was changed. ML model tests are different in spirit as they evaluate a model’s quality against some ground truth data. The goal is no longer to assert exact non-regression but to assert that overall quality is within certain limits. Let’s take the example of face detection. A FD model that has been retrained (let’s call it v2) on a different corpus will have different responses than the previous version (v1) on the same ground truth data : false positives (detected non-faces) and false negatives (missed faces) will be different. The test should determine if releasing v2 to production adds business value, for instance by checking that the classification accuracy of v2 exceeds that of v1. 

We introduced model versioning and packaging (think: versioned tarballs for ML models), ground truth data fixtures to evaluate our learning algorithms quality and ensure model traceability at client level so we know exactly which model, trained on which dataset, runs for which client. As a further improvement, client model configuration is automatically traced into a specific Notion page, for straightforward, user-friendly and up-to-date  documentation. 

While we were considering relying entirely on MLflow, we set out to use only MLflow pipelines due to some kubernetes version incompatibility. Still we are now in full control of the models we ship in production, leveraging all the benefits of a MLops workflow namely : 

  1. ML code versioning & packaging

  2. ML model validation & quality control 

  3. ML model registry storage

  4. ML client deployment configuration traceability


Testing pipeline data

PacketAI’s data pipeline comprises multiple processing blocks that receive some input (typically a Kafka topic of logs or metrics), process this input (typically by applying ML models) and write it back to the pipeline. 

At scale -and at time of writing,  this pipeline can process up to 1Tb of data per day while maintaining real-time operations, low latency and accommodating peaks of traffic. Given the acyclic data flow in the pipeline, responsibility of data correctness flows from upstream to downstream : downstream processing units receive and crunch data sent by upstream modules. 

These chained dependencies mean that a single data issue or skew immediately propagates to the downstream pipeline. Let us add that in the case of PacketAI, source data cannot be entirely trusted since it is external/client data (metrics and logs). For some reason that might not be communicated to PacketAI, logs might change. In other words, without any code change, the system could break due to untrusted data. Sanity check tests are therefore of uttermost importance. 

At PacketAI we have undergone coverage of all the building blocks that make up the pipeline, by implementing sanity checks at block input. Those checks verify compliance of data format (presence of required fields, non-nullity of certain fields, expected value range, etc.). If an issue is found, our guideline is to issue an error log (which can be later analyzed) and fill some Prometheus metric (typically of gauge type) which we can use for alerting. 



PacketAI has built a highly scalable, ML-based SaaS solution to help companies detect, monitor and resolve issues via analyzing logs and metrics. 

We have presented our 360-degree testing strategy, whereby we deploy testing initiatives for code, ML models and pipeline data. We strongly believe that the benefits of investing time in testing are paid back multiple times. 

Since we have enforced this approach, we have built a more stable, more reliable and highly scalable software which is used by clients of various sizes. The team has gained solid trust and confidence in the software they build, which itself helps foster innovation and unleash a risk-taking culture : testing reduces risk. 

Part one can be found here


PacketAI is the world’s first autonomous monitoring solution built for the modern age. Our solution has been developed after 5 years of intensive research in French & Canadian laboratories and we are backed by leading VC’s. To know more, book a free demo and get started today!

Follow @PacketSAS on Twitter or PacketAI Company page on LinkedIn to get the latest news!

Related Post

Subscribe to our newsletter