Incident Investigation is a core feature of Castrel that enables SRE teams to quickly identify root causes through AI-powered analysis combined with human expertise. Whether you're responding to an alert on your phone or conducting a deep investigation at your desk, Castrel provides the right tools for every scenario.
Incident Investigation is an AI-driven root cause analysis system that helps you pinpoint the source of production issues. When an alert is confirmed as an incident (via Alert Triage), Castrel automatically scans your infrastructure—K8s events, pod logs, database metrics, and more—to identify potential root causes and visualize how failures propagate through your system.
Unlike traditional monitoring tools that only tell you what went wrong, Incident Investigation tells you why it happened and where to look, enabling you to resolve incidents faster.
1. Start an Investigation
You can initiate an incident investigation in two ways:
2. Review the Analysis Report
After initiating an investigation, Castrel performs a comprehensive scan and generates an analysis report containing:
3. Choose Your Next Step
Based on the analysis results, you can take one of three paths:
| Scenario | Action | Description |
|---|---|---|
| Clear Root Cause | Confirm & Close | AI identified a definitive root cause with strong evidence. Review and confirm to close the investigation. |
| Need Manual Investigation | Get Report | Download a context summary with excluded possibilities and remaining suspects to continue investigation manually. |
| Right Direction, Need More | Provide Guidance | Use your domain knowledge to guide AI deeper into a specific direction. |
Incident Investigation is designed for bidirectional collaboration—you're not just a passive recipient of AI analysis. You can actively guide the investigation using your domain expertise.
Propagation Topology
The Propagation Topology view visualizes how failures spread through your system, organized into four layers:
| Layer | Icon | Description |
|---|---|---|
| Root Cause | 🔴 | The source of the failure |
| Key Propagation | 🟠 | Critical nodes where the failure propagates |
| Direct Impact | 🟡 | Services directly affected by the failure |
| Indirect Impact | ⚪ | Edge services affected through multiple hops |
You can interact with the topology to:
Hypothesis List
The Hypothesis List shows AI-generated potential root causes that you can manage:
| Action | Description |
|---|---|
| Add Hypothesis | Add your own hypothesis based on domain knowledge (e.g., "DBA changed indexes last week") |
| Verify Hypothesis | Request AI to collect more evidence for a specific hypothesis |
| Confirm Hypothesis | Mark a hypothesis as the confirmed root cause |
| Reject Hypothesis | Exclude a hypothesis from consideration |
Each hypothesis includes supporting evidence with logs, metrics, code diffs, or events that you can review.
Provide Guidance via Chat
You can guide the investigation using natural language:
Check the recent deployments of order-service, especially changes to transaction logic
Look into the database lock issues around 3:15 PM
Castrel will focus its analysis based on your guidance, combining your contextual knowledge with its comprehensive data scanning capabilities.
Castrel follows a systematic approach to root cause analysis:
1. Data Collection
Castrel collects data from connected sources within the specified time range:
2. Hypothesis Generation
Based on collected data, Castrel generates hypotheses by:
3. Propagation Analysis
Castrel builds a propagation model by:
4. Evidence Synthesis
For each hypothesis, Castrel compiles supporting evidence:
| Tip | Description |
|---|---|
| Connect All Data Sources | More integrations (metrics, logs, traces, change management) lead to more accurate root cause identification |
| Use Knowledge Base | Document expected behaviors and runbooks in Knowledges to help Castrel understand your system |
| Provide Domain Context | Share your expertise—AI handles data scanning, you provide business context |
| Review All Evidence | Examine the supporting evidence before confirming a hypothesis |