Features

Alert Triage

Efficiently triage and manage alerts with AI-powered insights.

Alert Triage is a powerful feature of Castrel that transforms how SRE teams handle alerts, eliminating alert fatigue and enabling data-driven decision-making directly within your existing workflow.

What is Alert Triage?

Alert Triage is an AI-powered alert classification and analysis system that automatically evaluates incoming alerts from your monitoring tools (Prometheus, Grafana, etc.) and provides intelligent insights directly in Slack. Instead of manually logging into multiple systems to investigate each alert, Castrel acts as your intelligent "co-pilot," delivering contextual analysis within seconds.

Alert Triage automatically classifies alerts into three categories:

  • Noise (False Positive): Expected system behavior, no action required
  • Potential Risk (Warning): Requires attention but not urgent
  • Incident: Confirmed user-facing impact, immediate action needed

How to Use Alert Triage

1. Trigger Alert Triage

Alert Triage can be triggered in two ways:

  • Webhook Integration: Automatically triggered when alerts are sent to connected IM tools. Currently supports Slack, with more integrations coming soon (see Roadmap).
  • Manual Trigger: Initiate triage manually from the Castrel interface / integrated for any alert you want to analyze.

2. Receive Report

When an alert is triggered, Castrel automatically analyze and responds with an analysis report containing three parts:

  • Classification Result: Whether it's noise, potential risk, or an incident, along with a confidence score
  • Alert Self-Analysis: Evaluates the alert's characteristics based on monitoring rules and historical data, including:
    • Whether this alert fires frequently
    • Whether it typically auto-recovers
    • Historical firing patterns and recovery times
  • Evidence Summary: Key data points and observations supporting the classification

3. Further Actions

Based on the classification, take appropriate action. If the alert is confirmed as an Incident, you can ask Castrel to initiate Incident Investigation to perform root cause analysis.

You can also provide feedback (Helpful / Not Helpful) to help Castrel improve its analysis accuracy. ya

How Castrel Triage Alerts

Castrel follows a systematic approach to analyze each alert:

1. Alert Rule & History Analysis

Castrel first retrieves the monitoring rule configuration and historical alert data to determine:

  • Whether this alert fires frequently
  • Whether it typically auto-recovers
  • Historical patterns and recovery times

2. Alert Object Identification & Observability Check

Castrel extracts the alert object (Service or Infrastructure) from the alert message and fields, then checks its observability data if integrations are available. You can also create Knowledges with alert trigger mode to inform Castrel how to investigate specific alerts.

3. Anomaly & Impact Assessment

Based on the observability data, Castrel evaluates:

  • Whether the metric shows a sudden anomaly or frequent fluctuations
  • Whether golden signals (latency, traffic, errors, saturation) are degraded
  • Whether SLOs are violated (you can define custom SLOs via slo.md in the object's knowledge)

Tips for Better Results

TipDescription
Connect All Data SourcesThe more data Castrel can access (metrics, logs, traces), the more accurate its classifications
Provide FeedbackProvide feedback to help Castrel improve. Unless you explicitly authorize, our team cannot access your feedback conversations or related alerts.
Document Known BehaviorsUse Knowledges to inform Castrel about expected behaviors (e.g., backup windows)

Common Questions