In this blog post, we will demonstrate the strength of a unified API in consolidating and managing alerts. We will create a workflow that, upon an alert triggers, generates a ServiceNow ticket, enriches it with data from a production database, and notifies the stakeholders.
What's in it for you
This technical blog post will guide you on how to:
Connect with any tool that generates alerts.
Aggregate all alerts in a single interface.
Enhance alerts with additional information from various sources.
Automate processes based on these alerts.
Introduction
Before we delve into the technicalities, let's have a brief introduction.
Despite a trend towards consolidation in the observability space, many organizations still utilize multiple tools to generate alerts.
The Grafana's Observability Survey from 2023 indicates that over 52% of companies employ more than six observability tools, often due to legacy systems, cost considerations, and specific functionalities.
Keep terminology
Providers - These are third-party tools that either trigger alerts, enrich alerts with data, or notify about alerts. Providers can include monitoring tools, databases, ticketing systems, or communication platforms.
Alerts - Essentially, these are events or signals triggered by your monitoring tools.
Workflows - Configurable automated processes that are initiated in response to alerts, designed to streamline your response to incidents by executing predefined actions, such as opening tickets, sending notifications, or initiating scripts.
Enough talking, let's get started
Install the CLI
# Clone Keep's repo and install Keep CLI using poetry
gh repo clone keephq/keep
cd keep && poetry install
# or just install it using pip
pip install keepcli
# for other installation options (e.g. docker) see https://docs.keephq.dev/cli/installation
Configure the CLI
You can easily start using Keep's managed platform without any other prerequisites by running:
# This will launch an oauth2 flow that will create a tenant for you and set you up
keep auth login
If you are using Keep's open source, run keep config to configure the CLI:
You can start using Keep without API key (the default docker-compose configuration). Once you deploy Keep to production, read about how to add authentication.
keep config
Enter your keep url [http://localhost:8080]:
Enter your api key (leave blank for localhost) []:
Config file created at .keep.yaml
Verify everything is OK
keep whoami
Api key valid{'tenant_id': 'XXXXXX-YYYY-ZZZZ-8b5a-939af9d7f63b'}
Connect your tools
Now we are going to connect all the providers we need - Datadog to get the alerts, ServiceNow to create and track the tickets, MySQL to enrich alerts with production data, and Slack - to notify who is needed.
# no providers
keep provider list
+----+------+------+--------------+-------------------+
| ID | Type | Name | Installed by | Installation time |
+----+------+------+--------------+-------------------+
+----+------+------+--------------+-------------------+
# list available providers
keep provider list --available
+-----------------+-------------------------------------------------------+
| Provider | Description |
+-----------------+-------------------------------------------------------+
| aks | Enrich alerts using data from AKS. |
...
| zabbix | Pull/Push alerts from Zabbix into Keep. |
| zenduty | Create incident in Zenduty. |
+-----------------+-------------------------------------------------------+
Now, let's connect datadog, MySQL, servicenow and slack
If we go the the UI at http://localhost:3000, we can see that the providers are installed:
Review the alerts
In this section, we are going to review the alerts, show how the alert looks in Keep, and demonstrate enrichment and filtering capabilities.
bash
# list all alerts
keep alert list
+---------------------+------------------------------------------------------------------+--------------------------------+----------+-----------+-------------+---------+-------------+---------------------+
| ID | Fingerprint | Name | Severity | Status | Environment | Service | Source | Last Received |
+---------------------+------------------------------------------------------------------+--------------------------------+----------+-----------+-------------+---------+-------------+---------------------+
| 7308482322424796476 | 5bcafb4ea94749f36871a2e1169d5252ecfb1c589d7464bd8bf863cdeb76b864 | Unauthorized access to API | high | Recovered | undefined | None | ['datadog'] | 2023-11-13T15:32:38 |
| 7308433771057253905 | 39f3a0d2cfe87885be0283c94ffd1cc35be1fd1bdd108c86ddf8e9db5d3bd7f0 | Test Alert | critical | Recovered | undefined | None | ['datadog'] | 2023-11-13T14:44:24 |
...
more alerts
...
+-----------+----------------------------+----------------------------+----------+--------+-------------+----------+-------------+---------------------------+
# Filter by attribute
keep alert list --filter service=keep-api
+-----------+----------------------------+----------------------------+----------+--------+-------------+----------+-------------+---------------------------+
| ID | Fingerprint | Name | Severity | Status | Environment | Service | Source | Last Received |
+-----------+----------------------------+----------------------------+----------+--------+-------------+----------+-------------+---------------------------+
| 120458754 | 5bcafb4ea94749f36871a2e1169d5252ecfb1c589d7464bd8bf863cdeb76b864 | 4xx-5xx Status Code Alert | medium | OK | production | keep-api | ['datadog'] | 2023-05-31T10:59:29+00:00 |
| 122655180 | 5bcafb4ea94749f36871a2e1169d5252ecfb1c389d7464bd8bf863cdeb76b864 | Unauthorized access to API | high | OK | production | keep-api | ['datadog'] | 2023-11-08T13:29:31+00:00 |
+-----------+----------------------------+----------------------------+----------+--------+-------------+----------+-------------+---------------------------+
keep alert list --filter severity=critical
+-----------+-------------+------------+----------+--------+-------------+----------+-------------+---------------------------+
| ID | Fingerprint | Name | Severity | Status | Environment | Service | Source | Last Received |
+-----------+-------------+------------+----------+--------+-------------+----------+-------------+---------------------------+
| 117493674 | 5bcafb4ea94749f36871a2e1169d5252ecfb1c589d7464bd8bf863cdeb76b862 | Prod Alert | critical | OK | production | tal-test | ['datadog'] | 2023-09-13T11:20:25+00:00 |
+-----------+-------------+------------+----------+--------+-------------+----------+-------------+---------------------------+
But what's even cooler is that we can filter on ANY alert attribute. Together with that Keep lets you enrich alerts with attributes from different sources, and you can achieve very cool things.
To put things into earth, let's say we created (we will of course automate this later) a ticket in our ticketing system. We want to correlate the alert with the ticket, so we will be able to sync any further changes to the ticket.
We also want information about the customer that is stored on our customers' database. We can get this information by running
select * from customers where customer_id = %customer_id%
+----+---------------------+------------+---------------------+--------------+---------------+-----------------------------+--------------------------------------+
| id | name | tier | email | phone_number | address | notes | customer_id |
+----+---------------------+------------+---------------------+--------------+---------------+-----------------------------+--------------------------------------+
| 1 | ABC Corporation | Enterprise | abc@example.com | 123-456-7890 | 123 Main St | Customer since 2010 | 05bc71af-820a-11ee-b23f-0242ac110002 |
Assuming we want to enrich the alert with customer name, customer email and ticket id:
keep alert enrich --fingerprint 39f3a0d2cfe87885be0283c94ffd1cc35be1fd1bdd108c86ddf8e9db5d3bd7f0 customer_id=1234 ticket_id=INC00001 customer_email=abd@example.com
# Now we can filter by responder:
keep alert list --filter ticket_id=INC00001
Create workflows
So far, we connected the providers, reviewed our Datadog alerts, and enriched them with customer data and ServiceNow tickets.
Now we will wrap it up and automate the whole process using Keep Workflows.
Anatomy of a Workflow
Before diving into the CLI commands, let's review the workflow we are going to run. Keep Workflows are very similar to GitHub Action workflows. We didn't want to invent the wheel here, so you should be pretty familiar with the syntax.
workflow:
# some metadata
id: example-workflow
description: Enriches the alert and create a ServiceNow ticket
# The first part is the triggers. We want this workflow to execute only on critical alerts. We can filter on any alert attribute and also use regex.
triggers:
- type: alert
filters:
- key: severity
value: critical
steps:
# The first step is to enrich the alert based on the SQL query. We want to add the customer name, email, and tier.
- name: get-more-details
provider:
type: mysql
config: " {{ providers.mysql-prod }} "
# {{ alert.customer_id }} will be extracted on runtime
with:
query: "select * from customers where customer_id = {{ alert.customer_id }}"
# Add those fields to the alert so we can use it
enrich_alert:
- key: customer_name
value: results[0].name
- key: customer_email
value: results[0].email
- key: customer_tier
value: results[0].tier
# second part - the actions
actions:
# create the servicenow ticket
- name: create-service-now-ticket
# In case the alert already assigned a ticket id, don't create a new one (imagine the case when the alert was triggered and then resolved, we don't want another ticket for the resolved). Also, we want to create a ticket only for Enterprise customers.
if: "not '{{ alert.ticket_id }}' and '{{ alert.tier }}' == 'Enterprise'"
provider:
type: servicenow
config: " {{ providers.servicenow }} "
with:
table_name: INCIDENT
payload:
short_description: "{{ alert.name }} - {{ alert.description }} [created by Keep]"
description: "{{ alert.description }}"
# Enrich the alert with these fields so we will have correlation between the alert and the ticket
enrich_alert:
- key: ticket_type
value: servicenow
- key: ticket_id
value: results.sys_id
- key: ticket_url
value: results.link
- key: ticket_status
value: results.stage
- key: table_name
value: "{{ alert.annotations.ticket_type }}"
Now after we have the workflow, let's apply and run it.
# no workflows
keep workflow list
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------------------------+--------------------------+----------------+
| ID | Workflow ID | Start Time | Triggered By | Status | Execution Time |
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------------------------+--------------------------+----------------+
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------------------------+--------------------------+----------------+
# Apply it:
keep workflow apply -f workflow.yaml
Workflow examples/workflows/blogpost.yml applied successfully
Workflow id: 652fe84e-5239-425b-8271-40accb1af72f
Workflow revision: 1
keep workflow list
+--------------------------------------+-------------------+-----------------------------------+----------+--------------+----------------------------+----------------------------+----------------------------+-----------------------+
| ID | Name | Description | Revision | Created By | Creation Time | Update Time | Last Execution Time | Last Execution Status |
+--------------------------------------+-------------------+-----------------------------------+----------+--------------+----------------------------+----------------------------+----------------------------+-----------------------+
| 652fe84e-5239-425b-8271-40accb1af72f | blogpost-workflow | Enrich the alerts and open ticket | 10 | keep | 2023-11-12T08:08:43.585226 | 2023-11-12T14:34:07.544301 | None | None |
+--------------------------------------+-------------------+-----------------------------------+----------+--------------+----------------------------+----------------------------+----------------------------+-----------------------+
# Run it with alert as input
keep workflow run --workflow-id blogpost-workflow --fingerprint 39f3a0d2cfe87885be0283c94ffd1cc35be1fd1bdd108c86ddf8e9db5d3bd7f0
Workflow blogpost-workflow run successfully
Workflow Run ID 33e71955-81f4-4118-9771-7b638f8c59b0
# Let's review the run
keep workflow runs logs 33e71955-81f4-4118-9771-7b638f8c59b0
+-----+----------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | ID | Timestamp | Message |
+-----+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 733 | 2023-11-13T16:11:40.462000 | Running step get-more-details |
| 734 | 2023-11-13T16:11:40.463000 | Action get-more-details evaluated to run! Reason: no condition, hence true. |
| 735 | 2023-11-13T16:11:40.524000 | Step get-more-details ran successfully |
| 736 | 2023-11-13T16:11:40.525000 | Running action create-service-now-ticket |
| 737 | 2023-11-13T16:11:40.525000 | Action create-service-now-ticket evaluated to run! Reason: no condition, hence true. |
| 738 | 2023-11-13T16:11:44.784000 | Created ticket: {'result': {'parent': '', 'made_sla': 'true', 'caused_by': '', 'watch_list': '', 'upon_reject': 'cancel', 'sys_updated_on': '2023-11-13 14:11:41', 'child_incidents': '0', 'hold_reason': '', 'origin_table': '', 'task_effective_number': 'INC' |
| 740 | 2023-11-13T16:12:47.552000 | Enriching alert |
| 741 | 2023-11-13T16:12:47.572000 | Alert enriched |
| 742 | 2023-11-13T16:12:47.573000 | Action create-service-now-ticket ran successfully |
| 743 | 2023-11-13T16:12:47.574000 | Finish to run workflow blogpost-workflow |
+-----+----------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
keep workflow runs list
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------+-------------+----------------------------------------------------+----------------+
| ID | Workflow ID | Start Time | Triggered By | Status | Error | Execution Time |
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------+-------------+----------------------------------------------------+----------------+
| 103df0aa-d6be-4290-9938-1563f8005e55 | 75c7eba2-51dc-411d-b39c-a500c98e3893 | 2023-11-13T14:11:37.911898 | manually by apikey@keephq.dev | success | None | 69 |
+--------------------------------------+--------------------------------------+----------------------------+-------------------------------+-------------+----------------------------------------------------+----------------+
# Let's make sure the alert was enriched with the ticket id
keep alert get 39f3a0d2cfe87885be0283c94ffd1cc35be1fd1bdd108c86ddf8e9db5d3bd7f0 | jq .ticket_id
"0f9982ec97667110beb0f0571153afa1"
# :)
Voila! Now, whenever an alert is triggered, it will be automatically enriched with data from our production database, and appropriate actions will be taken. If the alert is of high or critical severity, a ServiceNow ticket will be created and the alert will be updated with the ticket ID. For less severe alerts, the relevant individual will simply be notified.
Next steps
1. Join our Slack and start talking about alerting and monitoring. 2. ⭐️ Keep repo. 3. Start playing with Keep (no credit card needed!) at https://platform.keephq.dev 4. Missing any provider/feature? just open an issue at https://github.com/keephq/keep and we will add it ASAP (and of course contributions are welcome!)
Picture this: a flood of alerts pouring in from various monitoring systems, each clamoring for attention. Amidst this deluge, identifying critical incidents is akin to finding a needle in a haystack.
In the world of observability, vendor lock-in slows progress and spikes costs. OpenTelemetry broke some chains but didn't free us entirely. This post shows the bridge between talk and action and how platforms like Keep offer flexibility, interoperability, cost optimization, community-driven support, and an escape from vendor lock-in traps. If you maintain >1 observability/monitoring system, are concerned with vendor lock-in, and need help keeping track of what's going on and where, this post is for you.
We all have that one service that, for some Phantom-de-la-machina reason, gets stuck and requires some manual action, like maybe a reboot or a REST call.
Creating and maintaining effective alerts, avoiding alert fatigue, and promoting a strong alerting culture can be difficult tasks. Keep addresses these challenges by treating alerts as code, integrating with observability tools, and using LLMs.
Alerting (aka monitors/alarms) always felt like a second-class citizen within all the different monitoring/observability/infrastructure tools with a very narrow feature set, which in turn results in poor alerts, alert fatigue (yes, your muted Slack channel), unreliable product and a complete alerting-hell.
In the past month, we have engaged in conversations with over 50 engineers, engineering managers, and SREs to gather feedback on the products we are developing at Keep. Here is a summary of what we have learned.