PacketAI is an AIops startup that helps companies monitor, anticipate and troubleshoot infrastructure incidents faster using AI. In this blog post, we share some of our engineering best practices –focused on testing -both software and models, that we have put in place to ensure our platform can work at scale handling petabytes of production data in real time.
Part 1 of this blog would focus on the software engineering part.
PacketAI has built a highly scalable, AI-based SaaS solution to help companies analyze, understand and detect early signals of incidents found in logs and metrics. At the core of this system is an event-streaming data pipeline that can process very large data volumes in real time.
One of the cornerstones of high-availability and scalable architectures is engineering quality. While quality is (or should be) a desirable property of the software industry in general, there are higher expectations when working at large scales as defects pop up much faster and can cause a higher impact, both from the technical and business point of view, due to the high volume of data that might be corrupted or mis-managed in a fraction of time.
The rest of this article presents some of the engineering best practices centered around testing -both software and model, that we have put in place at PacketAI.
Software is like a Swiss watch : it is beautiful and complex, but an extremely fragile system. Writing automated tests is one of the best time investments for a software team, as it raises the quality bar, prevents defects from re-appearing and increases the team’s confidence in the product they build. Investing time in testing is also a practice of good software craftsmanship, as popularized by Extreme Programming.
While a unit test ensures that an individual component is functionally correct, the role of an integration test is to verify that several components interact as expected. A team needs both. In fact there are many other kinds of tests such as load/stress tests, smoke tests and end-to-end tests to name a few, but let’s focus on unit and integration tests as they are of primary importance.
Automated unit testing using CI/CD pipelines (at PacketAI we use GitHub actions) helps detect defects early on in the development cycle (the earliest the better), giving developers a chance to fix their code commit and avoid contaminating the production environment. Typically developers should run unit tests often on their machines and CI/CD pipelines should trigger unit tests after each commit. Note that to not slow down development pace, unit tests must be fast and simple. Finally, one other key benefit of testing is to reduce a developer’s stress to introduce a bug in the system.
Integration tests are a different and more complex sport that require writing interaction scenarios, possibly mock some third-party dependencies, prepare more complex fixtures and clarify what functional scope is to be tested. Integration tests are generally longer to run, so they should be triggered by CI/CD not after every commit (the risk would be to block development flow), but when merging a development branch to master branch, or when updating the staging environment. The benefit of integration testing in a complex software is to stabilize and secure the correctness of some business logic or processing chain that make up that software.
Company codebases generally have varying levels of test coverage. Some areas of the code might be well over 95% coverage, some might be below 30% while some parts of the code are simply not tested at all.
At PacketAI we decided on a strategy to increase our code coverage, with 3 guidelines:
Every new piece of code must come with unit tests.
Bugs found in production point to a lack of tests, which should be written.
Legacy code without tests can be covered partly but should be covered progressively
To enforce the guideline 1 at team’s level, we rely on SonarQube Quality Gates to check coverage and reject commits that decrease code coverage, or, as stated differently, code that comes unaccompanied by unit tests. By the way, since our stack is made of Node.js, Go and Python, we respectively use Jest, GoCover and coverage.py.
Guideline 2 is a practice that must be enforced, even if it is tempting to just “fix and ship” without writing an additional test case. Rather, once a bug is found in production, it means code was not tested against the specific data that triggered the bug. A first step is thus to extend a test fixture by adding data similar to what caused the issue. A second step is to run unit tests, without changing neither the tests nor the code. Maybe this will reveal other bugs in your code that would have gone unnoticed. Third step is to write a test that fails on the newly added data, and fourth step is to fix code so that all tests pass.
Regarding guideline 3 on legacy code without code coverage, we thought it would not make sense to cover everything, as the team would have spent all its time writing tests for the next 5 sprints. Rather we allocated a fraction of each sprint’s effort (typically 10%) to cover the most important and used part of the code. This tradeoff approach proves very useful and agile, as it is possible to modulate the amount of time spent covering historical code.
Hope you enjoyed reading. Stay tuned to learn more on the best practices to build and test highly scalable ML models in part 2.
PacketAI is the world’s first autonomous monitoring solution built for the modern age. Our solution has been developed after 5 years of intensive research in French & Canadian laboratories and we are backed by leading VC’s. To know more, book a free demo and get started today!
Follow @PacketSAS on Twitter or PacketAI Company page on LinkedIn to get the latest news!