Extending Grafana with Workflows

Gil Zellner
September 14, 2023
Extending Grafana with Workflows
Back to all Blog Posts

We all have that one service that, for some Phantom-de-la-machina reason, gets stuck and requires some manual action, like maybe a reboot or a REST call.

Why not automate it? You can do it with Keep's Python provider or the bash provider.

About the writer

Gil runs infrastructure at HourOne.ai, a Generative AI video startup.

Being the single infrastructure guy at a company, Gil has to be very prudent with his tool selection and overall time spent.

Here is a nice example of their work:

At HourOne we love our Grafana Cloud for graphs and alerts. We connected all our sources, and have tons of graphs, logs, traces, etc, but we were unhappy with the alert management flow, and how it actually reaches our users.

Grafana’s Alert manager leaves much to be desired, and we were not getting the results we wanted with it.

Their alert filtering system, and flow are complex to manage, and this resulted in us simply using it less.

Furthermore, we also want to be able to do some auto-remediation attempts before we actually bother people during their night sleep.

Unfortunately, Grafana does not offer that. Fortunately, Keep does.

I heard of Keep via a friend who saw them at Hackernews (https://news.ycombinator.com/item?id=37381268) and recommended we try to build something together.

Here are a few examples of cool things we can do

We all have that one service that, for some Phantom-de-la-machina reason, gets stuck and requires some manual action, like maybe a reboot, or a REST call.

Why not automate it? You can do it with Keep's Python provider or the bash provider. Also, Keep supports providers that Grafana does not, like Auth0, which we use. This allows us to automate around issues users may have with authentication, which we previously could not.

Another thing that Keep allows us to do is cross-reference multiple sources very easily.

It's not that Grafana can’t do that, but Keep makes it way more convenient. So, for example, you may get an alert from your scaling system, but you want to cross-reference it from data coming in from another system before you actually raise an alert.

This is made easier with Keep's workflows, which can allow you to run multiple successive steps.

AIOps! Finding Incidents in the Haystack of Alerts

Picture this: a flood of alerts pouring in from various monitoring systems, each clamoring for attention. Amidst this deluge, identifying critical incidents is akin to finding a needle in a haystack.

Tal Borenstein
April 11, 2024
AIOps! Finding Incidents in the Haystack of Alerts

Unifying alerts from various sources

Demonstrate the strength of a unified API in consolidating and managing alerts.

Shahar Glazner
November 26, 2023
Unifying alerts from various sources

Observability vendor lock-in is in the small details

In the world of observability, vendor lock-in slows progress and spikes costs. OpenTelemetry broke some chains but didn't free us entirely. This post shows the bridge between talk and action and how platforms like Keep offer flexibility, interoperability, cost optimization, community-driven support, and an escape from vendor lock-in traps. If you maintain >1 observability/monitoring system, are concerned with vendor lock-in, and need help keeping track of what's going on and where, this post is for you.

Tal Borenstein
October 31, 2023
Observability vendor lock-in is in the small details

Getting started with Keep — Observability Alerting with ease

Creating and maintaining effective alerts, avoiding alert fatigue, and promoting a strong alerting culture can be difficult tasks. Keep addresses these challenges by treating alerts as code, integrating with observability tools, and using LLMs.

Daniel Olabemiwo
May 14, 2023
Getting started with Keep — Observability Alerting with ease

Building a new shift-left approach for alerting

Alerting (aka monitors/alarms) always felt like a second-class citizen within all the different monitoring/observability/infrastructure tools with a very narrow feature set, which in turn results in poor alerts, alert fatigue (yes, your muted Slack channel), unreliable product and a complete alerting-hell.

Tal Borenstein
April 10, 2023
Building a new shift-left approach for alerting

Current problems in the alerting space

In the past month, we have engaged in conversations with over 50 engineers, engineering managers, and SREs to gather feedback on the products we are developing at Keep. Here is a summary of what we have learned.

Shahar Glazner
March 19, 2023
Current problems in the alerting space