Features

Incident Investigation

Investigate incidents faster with automated root cause analysis and human-AI collaboration.

Incident Investigation is a core feature of Castrel that enables SRE teams to quickly identify root causes through AI-powered analysis combined with human expertise. Whether you're responding to an alert on your phone or conducting a deep investigation at your desk, Castrel provides the right tools for every scenario.

What is Incident Investigation?

Incident Investigation is an AI-driven root cause analysis system that helps you pinpoint the source of production issues. When an alert is confirmed as an incident (via Alert Triage), Castrel automatically scans your infrastructure—K8s events, pod logs, database metrics, and more—to identify potential root causes and visualize how failures propagate through your system.

Unlike traditional monitoring tools that only tell you what went wrong, Incident Investigation tells you why it happened and where to look, enabling you to resolve incidents faster.

How to Use Incident Investigation

1. Start an Investigation

You can initiate an incident investigation in two ways:

  • From Alert Triage: After an alert is classified as an Incident, click "Start Investigation" to begin root cause analysis
  • Manual Trigger: From the Castrel interface, click "Start Investigation" and configure:
    • Time Range: Current (last 1 hour), recent alert time, or custom time period
    • Application: Select the affected application or service
    • Additional Context (optional): Paste alert content, specify resources, or describe observed symptoms

2. Review the Analysis Report

After initiating an investigation, Castrel performs a comprehensive scan and generates an analysis report containing:

  1. Hypothesis List: AI-generated potential root causes, each with supporting evidence
  2. Propagation Topology: A visual map showing how the failure spreads through your services
  3. Evidence Summary: Key data points including logs, metrics, code changes, and events

3. Choose Your Next Step

Based on the analysis results, you can take one of three paths:

ScenarioActionDescription
Clear Root CauseConfirm & CloseAI identified a definitive root cause with strong evidence. Review and confirm to close the investigation.
Need Manual InvestigationGet ReportDownload a context summary with excluded possibilities and remaining suspects to continue investigation manually.
Right Direction, Need MoreProvide GuidanceUse your domain knowledge to guide AI deeper into a specific direction.

Human-AI Collaborative Investigation

Incident Investigation is designed for bidirectional collaboration—you're not just a passive recipient of AI analysis. You can actively guide the investigation using your domain expertise.

Propagation Topology

The Propagation Topology view visualizes how failures spread through your system, organized into four layers:

LayerIconDescription
Root Cause🔴The source of the failure
Key Propagation🟠Critical nodes where the failure propagates
Direct Impact🟡Services directly affected by the failure
Indirect ImpactEdge services affected through multiple hops

You can interact with the topology to:

  • Mark a node as the suspected root cause for focused analysis
  • Explore the propagation path to understand blast radius

Hypothesis List

The Hypothesis List shows AI-generated potential root causes that you can manage:

ActionDescription
Add HypothesisAdd your own hypothesis based on domain knowledge (e.g., "DBA changed indexes last week")
Verify HypothesisRequest AI to collect more evidence for a specific hypothesis
Confirm HypothesisMark a hypothesis as the confirmed root cause
Reject HypothesisExclude a hypothesis from consideration

Each hypothesis includes supporting evidence with logs, metrics, code diffs, or events that you can review.

Provide Guidance via Chat

You can guide the investigation using natural language:

Check the recent deployments of order-service, especially changes to transaction logic
Look into the database lock issues around 3:15 PM

Castrel will focus its analysis based on your guidance, combining your contextual knowledge with its comprehensive data scanning capabilities.

How Castrel Investigates Incidents

Castrel follows a systematic approach to root cause analysis:

1. Data Collection

Castrel collects data from connected sources within the specified time range:

  • Kubernetes events and pod logs
  • Application metrics and traces
  • Database performance data
  • Deployment and configuration change history

2. Hypothesis Generation

Based on collected data, Castrel generates hypotheses by:

  • Detecting anomalies in metrics (latency spikes, error rate increases)
  • Correlating changes (deployments, config updates) with incident timing
  • Analyzing error logs and stack traces
  • Identifying resource saturation patterns

3. Propagation Analysis

Castrel builds a propagation model by:

  • Tracing service dependencies
  • Identifying the failure's origin point
  • Mapping how the failure spreads through your architecture

4. Evidence Synthesis

For each hypothesis, Castrel compiles supporting evidence:

  • Relevant log entries with timestamps
  • Metric charts showing anomalies
  • Code diffs from recent changes
  • Correlation with similar past incidents

Tips for Better Results

TipDescription
Connect All Data SourcesMore integrations (metrics, logs, traces, change management) lead to more accurate root cause identification
Use Knowledge BaseDocument expected behaviors and runbooks in Knowledges to help Castrel understand your system
Provide Domain ContextShare your expertise—AI handles data scanning, you provide business context
Review All EvidenceExamine the supporting evidence before confirming a hypothesis

Common Questions