AIOps! Finding Incidents in the Haystack of Alerts

Tal Borenstein

•

April 11, 2024

AIOps! Finding Incidents in the Haystack of Alerts

The challenge of managing alerts and incidents looms large in the realm of AIOps (Artificial Intelligence for IT Operations). Surprisingly, traditional approaches often rely on rule-based engines, raising the question, "Where is AI?".

At Keep, we're pushing the boundaries with an innovative solution: an AI-powered alert correlation engine.

The Challenge: A Sea of Alerts, Incidents & Events

In today's complex IT landscape, organizations face an avalanche of alerts stemming from diverse sources like Kubernetes clusters, cloud providers, and third-party tools. These alerts often lack context, inundating teams with noise and making it arduous to discern crucial incidents from the noise. This deluge not only overwhelms IT teams but also hampers incident response and resolution times, ultimately impacting business continuity.

A New Approach: AI Automated Alert Correlation

At Keep, we're spearheading a paradigm shift with our AI automated alert correlation engine. Unlike conventional rule-based systems, our approach leverages state-of-the-art LLM models trained on real-world incident data.

There are two key components to our model training strategy:

General Incident Training: We train our models on a rich repository of general incidents sourced from open knowledge bases like GitLab's production infrastructure incidents. By analyzing historical incidents across diverse environments, our models learn to discern patterns and anomalies, enabling more accurate incident detection.
Customer-Specific Training: Upon onboarding customers, our engineers fine-tune the model to adapt to their specific data sources and events. This entails learning from past incidents, customizing rules, and refining the model's understanding of the customer's infrastructure. This tailored approach ensures that our AI engine is finely attuned to each customer's unique environment, maximizing its effectiveness.

The Quest for Efficiency: Finding the Needle

Our AI engine acts as a discerning needle, swiftly identifying and correlating incidents in real-time. By automating the correlation process, we alleviate the burden on IT teams, enabling them to focus their efforts on resolving critical issues promptly. Furthermore, our AI engine continuously learns and evolves, adapting to shifting patterns and emerging threats, thereby enhancing its efficacy over time.

Feedback Loop

‍But how does our AI engine achieve such precision?

Our feedback loop is a dynamic mechanism that ensures continuous improvement. Users have the ability to flag alerts that were correlated falsely or add missing alerts from the feed. These actions serve as input for refining the model, updating the training data, and enhancing its accuracy.

Continuous Learning and Improvement

‍With each iteration of the feedback loop, our AI engine becomes increasingly adept at distinguishing between genuine incidents and false alarms. It learns from past mistakes, fine-tunes its model, and hones its ability to discern subtle nuances in the alert data. As a result, the efficacy of our alert correlation engine evolves organically, ensuring unparalleled accuracy and reliability in incident detection.

Integrations, integrations, integrations!

Lastly, but significantly, Keep operates as a plug-in solution, eliminating the need to overhaul your existing IRM stack or migrate users to a new platform. Simply integrate it into your infrastructure, allow the model to learn, and experience the benefits of an enhanced workflow seamlessly.

‍