Features

Incident Investigation

Investigate incidents faster with automated root cause analysis and human-AI collaboration.

Incident Investigation is a core feature of Castrel that enables SRE teams to quickly identify root causes through AI-powered analysis combined with human expertise. Whether you're responding to an alert on your phone or conducting a deep investigation at your desk, Castrel provides the right tools for every scenario.

What is Incident Investigation?

Incident Investigation is an AI-driven root cause analysis system that helps you pinpoint the source of production issues. When an alert is confirmed as an incident (via Alert Triage), Castrel automatically scans your infrastructure—K8s events, pod logs, database metrics, and more—to identify potential root causes and visualize how failures propagate through your system.

Unlike traditional monitoring tools that only tell you what went wrong, Incident Investigation tells you why it happened and where to look, enabling you to resolve incidents faster.

How to Use Incident Investigation

1. Start an Investigation

You can initiate an incident investigation in two ways:

From Alert Triage: After an alert is classified as an Incident, click "Start Investigation" to begin root cause analysis
Manual Trigger: From the Castrel interface, click "Start Investigation" and configure:
- Time Range: Current (last 1 hour), recent alert time, or custom time period
- Application: Select the affected application or service
- Additional Context (optional): Paste alert content, specify resources, or describe observed symptoms

2. Review the Analysis Report

After initiating an investigation, Castrel performs a comprehensive scan and generates an analysis report containing:

Hypothesis List: AI-generated potential root causes, each with supporting evidence
Propagation Topology: A visual map showing how the failure spreads through your services
Evidence Summary: Key data points including logs, metrics, code changes, and events

3. Choose Your Next Step

Based on the analysis results, you can take one of three paths:

Scenario	Action	Description
Clear Root Cause	Confirm & Close	AI identified a definitive root cause with strong evidence. Review and confirm to close the investigation.
Need Manual Investigation	Get Report	Download a context summary with excluded possibilities and remaining suspects to continue investigation manually.
Right Direction, Need More	Provide Guidance	Use your domain knowledge to guide AI deeper into a specific direction.

Human-AI Collaborative Investigation

Incident Investigation is designed for bidirectional collaboration—you're not just a passive recipient of AI analysis. You can actively guide the investigation using your domain expertise.

Propagation Topology

The Propagation Topology view visualizes how failures spread through your system, organized into four layers:

Layer	Icon	Description
Root Cause	🔴	The source of the failure
Key Propagation	🟠	Critical nodes where the failure propagates
Direct Impact	🟡	Services directly affected by the failure
Indirect Impact	⚪	Edge services affected through multiple hops

You can interact with the topology to:

Mark a node as the suspected root cause for focused analysis
Explore the propagation path to understand blast radius

Hypothesis List

The Hypothesis List shows AI-generated potential root causes that you can manage:

Action	Description
Add Hypothesis	Add your own hypothesis based on domain knowledge (e.g., "DBA changed indexes last week")
Verify Hypothesis	Request AI to collect more evidence for a specific hypothesis
Confirm Hypothesis	Mark a hypothesis as the confirmed root cause
Reject Hypothesis	Exclude a hypothesis from consideration

Each hypothesis includes supporting evidence with logs, metrics, code diffs, or events that you can review.

Provide Guidance via Chat

You can guide the investigation using natural language:

Check the recent deployments of order-service, especially changes to transaction logic

Look into the database lock issues around 3:15 PM

Castrel will focus its analysis based on your guidance, combining your contextual knowledge with its comprehensive data scanning capabilities.

How Castrel Investigates Incidents

Castrel follows a systematic approach to root cause analysis:

1. Data Collection

Castrel collects data from connected sources within the specified time range:

Kubernetes events and pod logs
Application metrics and traces
Database performance data
Deployment and configuration change history

2. Hypothesis Generation

Based on collected data, Castrel generates hypotheses by:

Detecting anomalies in metrics (latency spikes, error rate increases)
Correlating changes (deployments, config updates) with incident timing
Analyzing error logs and stack traces
Identifying resource saturation patterns

3. Propagation Analysis

Castrel builds a propagation model by:

Tracing service dependencies
Identifying the failure's origin point
Mapping how the failure spreads through your architecture

4. Evidence Synthesis

For each hypothesis, Castrel compiles supporting evidence:

Relevant log entries with timestamps
Metric charts showing anomalies
Code diffs from recent changes
Correlation with similar past incidents

Tips for Better Results

Tip	Description
Connect All Data Sources	More integrations (metrics, logs, traces, change management) lead to more accurate root cause identification
Use Knowledge Base	Document expected behaviors and runbooks in Knowledges to help Castrel understand your system
Provide Domain Context	Share your expertise—AI handles data scanning, you provide business context
Review All Evidence	Examine the supporting evidence before confirming a hypothesis

Common Questions

Alert Triage

Efficiently triage and manage alerts with AI-powered insights.

Deployment Verification

Automatically diagnose deployment failures by analyzing logs, code changes, and generating actionable fix suggestions.