
The diagram below shows the core workflow of Castrel's incident troubleshooting Agent

The effectiveness of AI troubleshooting largely depends on the context data it can access. A complete observability context should include the following dimensions:
| Data Type | Purpose | Typical Sources |
|---|---|---|
| Metrics | Detect anomalies, quantify problem severity | Prometheus, Zabbix, CloudWatch |
| Logs | Locate specific errors, obtain contextual details | Elasticsearch, Loki, Splunk |
| Traces | Track request paths, locate slow calls | Jaeger, Tempo, SkyWalking |
Relying solely on any single data type makes efficient troubleshooting difficult. Metrics tell you "something is wrong," Logs tell you "what specific error occurred," and Traces tell you "where in the chain the problem happened."
Besides the three observability data types, AI also needs to understand the system's topology:
With call relationships, AI can determine whether a fault propagated from upstream or is the current service's own problem; with deployment relationships, AI can correlate infrastructure-level anomalies (such as host CPU spikes, disk full).
The traditional AI analysis approach is to collect large amounts of telemetry data, then have the model summarize it all at once. This "summary engine" mode has obvious limitations: as data volume increases, the model is easily distracted by irrelevant signals, and output quality actually decreases.
A more efficient approach is to have AI think like a human SRE:
| Dimension | Traditional Summary Mode | Hypothesis-Driven Mode |
|---|---|---|
| Data Processing | Collect all data at once | Query specific data on demand |
| Noise Interference | Easily distracted by irrelevant anomalies | Focus on causal relationships |
| Investigation Depth | Stays at surface symptoms | Recursively drills down to root cause |
| Explainability | Conclusions hard to trace | Complete hypothesis verification chain |
The hypothesis-driven investigation approach requires bidirectional human-AI collaboration to maximize effectiveness. Traditional AI analysis is one-way: AI gives conclusions, users accept or reject. A more efficient model is bidirectional collaboration:
AI: (Completed preliminary hypothesis analysis)
"Verified 3 hypotheses:
✓ Database slow queries increasing - matches alert timing
✗ Network latency - no anomaly
✗ Resource shortage - CPU/memory sufficient
Continuing deep investigation in database direction..."
User: (Sees hypothesis list, recalls last week's change)
"DBA modified the order table index last week"
AI: (Adds new hypothesis and verifies)
"New hypothesis: Index change caused query plan change
→ Check DDL change records: Found order table index deleted last Thursday
→ Compare query plans: Used index scan before deletion, changed to full table scan after
→ Time correlation: Index deletion time matches slow query start time
✓ Hypothesis verified, confirmed as root cause"
User: (Verification successful)
"Confirmed! Need to restore the index."
AI doesn't always directly find the root cause—especially when data integration is incomplete. But that doesn't mean AI's analysis has no value.
In complex incidents, the root cause may span multiple systems or require multiple steps to find. The hypothesis-driven approach allows AI to recursively investigate deeper until the search space is exhausted.
Case: Pod Frequently Restarting (CrashLoopBackOff)
Alert: Kubernetes Pod enters CrashLoopBackOff state
First Layer Analysis:
→ Hypothesis: Memory shortage causing OOM Kill
→ Verification: Check Pod events, confirmed OOMKilled
→ Conclusion: Verified, but this is only the surface cause
Second Layer Analysis (Recursive Deep Dive):
→ Hypothesis: Abnormally large request load causing memory spike
→ Verification: Check inbound traffic, found Kafka message size abnormal
→ Conclusion: Verified, continue deeper
Third Layer Analysis:
→ Hypothesis: Upstream system sending abnormally large messages
→ Verification: Check message source, found certain batch data contains corrupted large files
→ Conclusion: Root cause confirmed - upstream data anomaly causing message size overflow
Earlier versions of AI might stop at the first layer, giving a "Pod OOM" conclusion. But this provides limited help to engineers—they already know this from the alert. What's truly valuable is finding why it's OOMing.
Even if AI lacks sufficient data to directly locate the root cause, it can often:
This "elimination method" itself saves users significant time. In traditional troubleshooting, engineers often need to check network, resources, cache, and other infrastructure one by one before ruling out these possibilities. AI can complete these checks in minutes, letting users focus directly on the truly probable problem directions.
When AI cannot continue deeper due to insufficient data, it can provide structured context handoff for users:
📋 Investigation Progress Handoff
⏱️ Analysis Time: 5 minutes | Components Scanned: 12
✅ Ruled Out:
• Network connectivity normal (Ping <1ms, no packet loss)
• K8s resources sufficient (CPU <60%, Memory <70%)
• Cache hit rate normal (Redis 99.2%)
🎯 General Direction:
• Problem concentrated on order-service → mysql-cluster link
• Probability of database performance-related issues is high
⚠️ Needs Manual Confirmation (Missing Data Sources):
• Database slow query logs (not integrated)
• Recent Schema change records (not integrated)

Without SOPs or Runbooks, AI may need to do extensive exploration when first encountering certain types of problems. But these exploration results shouldn't be wasted.
The core of the hypothesis-driven investigation approach is verifying causal relationships—determining whether a certain anomaly actually caused the current alert. However, causal verification is far more complex than it appears:
| Verification Dimension | Description | Challenge |
|---|---|---|
| Time Correlation | Whether anomaly occurrence time matches alert time | Timestamps may have skew in distributed systems |
| Propagation Path | Whether the anomaly is on the upstream/downstream link of the alert | Requires complete call topology map |
| Impact Scope | Whether resources affected by anomaly are related to the alert | Requires understanding dependencies between resources |
| Business Semantics | Whether the anomaly makes sense at the business level | Requires deep understanding of business logic |
The last item, "Business Semantics," especially relies on deep understanding of customer business. For example:
This business knowledge cannot be directly obtained from telemetry data; it must be accumulated through knowledge accumulation.
When an incident investigation is completed, AI can summarize the troubleshooting process into knowledge entries:
This knowledge can be bound to specific alert types or resources. When a similar problem is encountered next time:
First Time:
• Alert: order-service P95 latency increase
• Investigation Process: Check network → Check resources → Check database → Found index issue
• Accumulated Knowledge: Bound to order-service + latency-type alerts
Second Time:
• Same alert triggers
• AI automatically correlates knowledge: "Last time a similar problem was caused by an index, should we prioritize checking the database?"
• After user confirms, directly jump to database check, skip network and resource investigation
• Investigation time reduced from 30 minutes to 5 minutes
| Capability | Description |
|---|---|
| Observability Context | Integrating Metrics, Logs, Traces, and call topology |
| Hypothesis-Driven | Form hypothesis → Verify → Recursively drill down, rather than simple summarization |
| Human-AI Collaboration | AI scans data, humans provide business context and historical experience |
| Exit Strategies | Even if root cause cannot be located, can rule out distractions and output key findings |
| Knowledge Accumulation | Accumulate business knowledge to improve subsequent troubleshooting accuracy and efficiency |
The goal of Castrel's incident troubleshooting Agent is not "AI replacing humans," but making human-AI collaboration efficiency far exceed pure AI or pure human efforts.