Current problems in the alerting space

Shahar Glazner
March 19, 2023
Current problems in the alerting space
Back to all Blog Posts

In the past month, we have engaged in conversations with over 50 engineers, engineering managers, and SREs to gather feedback on the products we are developing at Keep. Here is a summary of what we have learned.

TL;DR: Creating and maintaining effective alerts, avoiding alert fatigue, and promoting a strong alerting culture can be difficult tasks. Keep addresses these challenges by treating alerts as code, integrating with observability tools, and using LLMs. Want to learn more? talk with me at shahar@keephq.dev, join Keep Slack  or just start play with Keep.

Why alerting?

Now, let’s discuss why alerting is crucial.

With the increasing reliance on digital systems, monitoring, and alerting have become more critical than ever. Downtime or slow website performance can lead to significant financial losses and drive customers away to other competitors.

To meet the growing demand for observability, there has been a significant proliferation of observability tools, with companies like Datadog, Grafana, New Relic, Elasticsearch, and Splunk dominating the market. In addition, many other tools like Sentry, Coralogix, Sumo Logic, and BugSnag have also gained widespread adoption.

According to Grafana Labs Observability Survey 2023, 52% of companies use 6 or more observability tools (!) highlighting the significance of the problem.

Tools that fire alerts

So what’s the problem, if any?

There are several current problems around alerting that hinder companies from getting the most out of their monitoring systems. Let’s review them.

1. Alerts fatigue

One of the significant problems with alerting is alert fatigue. When you receive too many alerts, it can be challenging to determine which ones are critical and which ones can be ignored. This can lead to a lack of attention to alerts, which can ultimately result in missing critical issues.

2. Monitoring your monitoring

It’s essential to ensure that your monitoring tools are working correctly and providing accurate and timely alerts. However, it can be challenging to keep track of all the monitoring tools that you’re using, and it can be even more challenging to keep them all in sync.

3. Alerts maintenance

As systems change, alerts may need to be updated to reflect these changes. However, it can be challenging to keep track of all the alerts that need to be updated, leading to outdated alerts being triggered (or not triggered at all).

4. Lack of developer experience

Many companies rely on developers to set up and maintain their alerting systems. However, not all developers have the necessary experience to create effective alerting systems.

5. Too many tools

Finally, many companies have too many alerting tools, making it challenging to keep track of alerts, leading to confusion and missed alerts.

In conclusion, monitoring and alerting are critical for companies to ensure their systems run efficiently and effectively. However, several current problems with alerting can hinder companies from getting the most out of their monitoring systems. By understanding these problems, companies can work to overcome them and ensure that their monitoring systems provide accurate and timely alerts.

How does keep solving that?

Keep High-Level Architecture

Keep takes a holistic approach to solving all these problems with alerting. By treating alerts as code and integrating them with existing observability tools, along with leveraging AI, Keep can achieve the following objectives:

  1. Measure engagement, reduce noise, add context, and fine-tune the alerts.
  2. Single pane of glass — by integrating with all of your observability tools, Keep decoupling and deduplicating alerts so if your database is down, you’ll know that the alerts from the frontend are because of that.
  3. Using Keep’s CI/CD integration, you can maintain your alerts as easily as adding a new step in your GitHub Action.
  4. Decouple what you want to alert from the actual tool —using Keep’s semantic layer, a developer can just write “Using Datadog, alert me when service X is down for than Y minutes”.

Summary

In today’s digital age, monitoring and alerting are critical for ensuring the smooth functioning of a company’s systems. However, several problems with alerting are hindering companies from getting the most out of their monitoring systems. These problems include alert fatigue, difficulty in monitoring your monitoring, alert maintenance, lack of developer experience, and too many alerting tools. Keep, a platform that treats alerts as code and integrates them with existing observability tools, provides a holistic approach to solving these issues. By leveraging AI and using a semantic layer, Keep can measure engagement, reduce noise, add context, and fine-tune alerts, providing a single pane of glass for all observability tools, and allowing for easy maintenance of alerts.

AIOps! Finding Incidents in the Haystack of Alerts

Picture this: a flood of alerts pouring in from various monitoring systems, each clamoring for attention. Amidst this deluge, identifying critical incidents is akin to finding a needle in a haystack.

Tal Borenstein
April 11, 2024
AIOps! Finding Incidents in the Haystack of Alerts

Unifying alerts from various sources

Demonstrate the strength of a unified API in consolidating and managing alerts.

Shahar Glazner
November 26, 2023
Unifying alerts from various sources

Observability vendor lock-in is in the small details

In the world of observability, vendor lock-in slows progress and spikes costs. OpenTelemetry broke some chains but didn't free us entirely. This post shows the bridge between talk and action and how platforms like Keep offer flexibility, interoperability, cost optimization, community-driven support, and an escape from vendor lock-in traps. If you maintain >1 observability/monitoring system, are concerned with vendor lock-in, and need help keeping track of what's going on and where, this post is for you.

Tal Borenstein
October 31, 2023
Observability vendor lock-in is in the small details

Extending Grafana with Workflows

We all have that one service that, for some Phantom-de-la-machina reason, gets stuck and requires some manual action, like maybe a reboot or a REST call.

Gil Zellner
September 14, 2023
Extending Grafana with Workflows

Getting started with Keep — Observability Alerting with ease

Creating and maintaining effective alerts, avoiding alert fatigue, and promoting a strong alerting culture can be difficult tasks. Keep addresses these challenges by treating alerts as code, integrating with observability tools, and using LLMs.

Daniel Olabemiwo
May 14, 2023
Getting started with Keep — Observability Alerting with ease

Building a new shift-left approach for alerting

Alerting (aka monitors/alarms) always felt like a second-class citizen within all the different monitoring/observability/infrastructure tools with a very narrow feature set, which in turn results in poor alerts, alert fatigue (yes, your muted Slack channel), unreliable product and a complete alerting-hell.

Tal Borenstein
April 10, 2023
Building a new shift-left approach for alerting