AIOps! Finding Incidents in the Haystack of Alerts

Tal Borenstein
April 11, 2024
AIOps! Finding Incidents in the Haystack of Alerts
Back to all Blog Posts

The challenge of managing alerts and incidents looms large in the realm of AIOps (Artificial Intelligence for IT Operations). Surprisingly, traditional approaches often rely on rule-based engines, raising the question, "Where is AI?".

At Keep, we're pushing the boundaries with an innovative solution: an AI-powered alert correlation engine.

The Challenge: A Sea of Alerts, Incidents & Events

In today's complex IT landscape, organizations face an avalanche of alerts stemming from diverse sources like Kubernetes clusters, cloud providers, and third-party tools. These alerts often lack context, inundating teams with noise and making it arduous to discern crucial incidents from the noise. This deluge not only overwhelms IT teams but also hampers incident response and resolution times, ultimately impacting business continuity.

A New Approach: AI Automated Alert Correlation

At Keep, we're spearheading a paradigm shift with our AI automated alert correlation engine. Unlike conventional rule-based systems, our approach leverages state-of-the-art LLM models trained on real-world incident data.

There are two key components to our model training strategy:

  1. General Incident Training: We train our models on a rich repository of general incidents sourced from open knowledge bases like GitLab's production infrastructure incidents. By analyzing historical incidents across diverse environments, our models learn to discern patterns and anomalies, enabling more accurate incident detection.
  2. Customer-Specific Training: Upon onboarding customers, our engineers fine-tune the model to adapt to their specific data sources and events. This entails learning from past incidents, customizing rules, and refining the model's understanding of the customer's infrastructure. This tailored approach ensures that our AI engine is finely attuned to each customer's unique environment, maximizing its effectiveness.

The Quest for Efficiency: Finding the Needle

Our AI engine acts as a discerning needle, swiftly identifying and correlating incidents in real-time. By automating the correlation process, we alleviate the burden on IT teams, enabling them to focus their efforts on resolving critical issues promptly. Furthermore, our AI engine continuously learns and evolves, adapting to shifting patterns and emerging threats, thereby enhancing its efficacy over time.

Feedback Loop

But how does our AI engine achieve such precision?

Our feedback loop is a dynamic mechanism that ensures continuous improvement. Users have the ability to flag alerts that were correlated falsely or add missing alerts from the feed. These actions serve as input for refining the model, updating the training data, and enhancing its accuracy.

Continuous Learning and Improvement

With each iteration of the feedback loop, our AI engine becomes increasingly adept at distinguishing between genuine incidents and false alarms. It learns from past mistakes, fine-tunes its model, and hones its ability to discern subtle nuances in the alert data. As a result, the efficacy of our alert correlation engine evolves organically, ensuring unparalleled accuracy and reliability in incident detection.

Integrations, integrations, integrations!

Lastly, but significantly, Keep operates as a plug-in solution, eliminating the need to overhaul your existing IRM stack or migrate users to a new platform. Simply integrate it into your infrastructure, allow the model to learn, and experience the benefits of an enhanced workflow seamlessly.

Unifying alerts from various sources

Demonstrate the strength of a unified API in consolidating and managing alerts.

Shahar Glazner
November 26, 2023
Unifying alerts from various sources

Observability vendor lock-in is in the small details

In the world of observability, vendor lock-in slows progress and spikes costs. OpenTelemetry broke some chains but didn't free us entirely. This post shows the bridge between talk and action and how platforms like Keep offer flexibility, interoperability, cost optimization, community-driven support, and an escape from vendor lock-in traps. If you maintain >1 observability/monitoring system, are concerned with vendor lock-in, and need help keeping track of what's going on and where, this post is for you.

Tal Borenstein
October 31, 2023
Observability vendor lock-in is in the small details

Extending Grafana with Workflows

We all have that one service that, for some Phantom-de-la-machina reason, gets stuck and requires some manual action, like maybe a reboot or a REST call.

Gil Zellner
September 14, 2023
Extending Grafana with Workflows

Getting started with Keep — Observability Alerting with ease

Creating and maintaining effective alerts, avoiding alert fatigue, and promoting a strong alerting culture can be difficult tasks. Keep addresses these challenges by treating alerts as code, integrating with observability tools, and using LLMs.

Daniel Olabemiwo
May 14, 2023
Getting started with Keep — Observability Alerting with ease

Building a new shift-left approach for alerting

Alerting (aka monitors/alarms) always felt like a second-class citizen within all the different monitoring/observability/infrastructure tools with a very narrow feature set, which in turn results in poor alerts, alert fatigue (yes, your muted Slack channel), unreliable product and a complete alerting-hell.

Tal Borenstein
April 10, 2023
Building a new shift-left approach for alerting

Current problems in the alerting space

In the past month, we have engaged in conversations with over 50 engineers, engineering managers, and SREs to gather feedback on the products we are developing at Keep. Here is a summary of what we have learned.

Shahar Glazner
March 19, 2023
Current problems in the alerting space