Alert Triage is a powerful feature of Castrel that transforms how SRE teams handle alerts, eliminating alert fatigue and enabling data-driven decision-making directly within your existing workflow.
Alert Triage is an AI-powered alert classification and analysis system that automatically evaluates incoming alerts from your monitoring tools (Prometheus, Grafana, etc.) and provides intelligent insights directly in Slack. Instead of manually logging into multiple systems to investigate each alert, Castrel acts as your intelligent "co-pilot," delivering contextual analysis within seconds.
Alert Triage automatically classifies alerts into three categories:
1. Trigger Alert Triage
Alert Triage can be triggered in two ways:
2. Receive Report
When an alert is triggered, Castrel automatically analyze and responds with an analysis report containing three parts:
3. Further Actions
Based on the classification, take appropriate action. If the alert is confirmed as an Incident, you can ask Castrel to initiate Incident Investigation to perform root cause analysis.
You can also provide feedback (Helpful / Not Helpful) to help Castrel improve its analysis accuracy. ya
Castrel follows a systematic approach to analyze each alert:
1. Alert Rule & History Analysis
Castrel first retrieves the monitoring rule configuration and historical alert data to determine:
2. Alert Object Identification & Observability Check
Castrel extracts the alert object (Service or Infrastructure) from the alert message and fields, then checks its observability data if integrations are available. You can also create Knowledges with alert trigger mode to inform Castrel how to investigate specific alerts.
3. Anomaly & Impact Assessment
Based on the observability data, Castrel evaluates:
slo.md in the object's knowledge)| Tip | Description |
|---|---|
| Connect All Data Sources | The more data Castrel can access (metrics, logs, traces), the more accurate its classifications |
| Provide Feedback | Provide feedback to help Castrel improve. Unless you explicitly authorize, our team cannot access your feedback conversations or related alerts. |
| Document Known Behaviors | Use Knowledges to inform Castrel about expected behaviors (e.g., backup windows) |
Castrel combines two analysis approaches:
Black Box Analysis examines the alert rule's historical data to identify patterns like frequent firing, auto-recovery behavior, and correlations with scheduled tasks.
White Box Analysis inspects the alert object's observability data to determine if there's actual impact on golden signals (latency, errors, traffic, saturation) or SLO violations.