LLM-Driven Root Cause Analysis for Distributed System Incidents: A Reproducible Framework

Akhil Reddy Mandadi

doi:10.63956/jitar.v1i2.66

Authors

Akhil Reddy Mandadi Independent Researcher

DOI:

https://doi.org/10.63956/jitar.v1i2.66

Keywords:

Root Cause Analysis; Large Language Models; Distributed Systems; Observability; AIOps; Site Reliability Engineering; Chaos Engineering; Retrieval-Augmented Generation

Abstract

In Distributed Systems, Root Cause Analysis (RCA) is a complex and time-sensitive task that demands correlating a variety of telemetry data such as logs, metrics, and distributed traces. In traditional contexts these activities are fundamentally dependent on the skills of Site Reliability Engineers (SREs) and incident diagnosis can be time-consuming and challenging, and becomes increasingly difficult as applications move to increasingly complex cloud-native contexts. The opportunities for automated reasoning by recent advances in Large Language Models (LLMs) are also accompanied by methods for automated RCA based on LLMs, which are often limited to unstructured summaries and have the potential for hallucination and unreliable conclusions. VerifiedRCA proposes an extensible and reproducible approach combining LLM reasoning, structured telemetry retrieval and tool-call verification to ensure trustworthy automated incident diagnosis. The proposed architecture consists of four layers of a pipeline: telemetry ingestion, retrieval-augmented context construction, hypothesis generation and evidence-based verification. Synthetic scenarios were injected (120 in total) into a realistic micro services environment built using Google Hipster Shop, Istio service mesh and Chaos Mesh fault injection for evaluation. Using the accuracy, time-to-diagnosis and explanation quality metrics, and performance was compared to rule-based systems, classical machine learning approaches, and ungrounded LLM baselines. The results show that grounding the LLM's reasoning with telemetry verification greatly enhances the diagnostic accuracy, with higher precision, more rapid convergence, and a decrease in unsupported reasoning paths. This study provides a clear RCA framework and an open benchmark that can facilitate future research and development in operational intelligence with the help of LLMs and support reproducible research.

References

Ahmed, T., Ghosh, S., Bansal, C., Zimmermann, T., Zhang, X., & Rajmohan, S. (2023). Recommending root-cause and mitigation steps for cloud incidents using large language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023).

Avgerinos, V., Ramantas, K., Alonso, L., & Verikoukis, C. (2025). ARM: Autonomous Remediation & Management with LLM Agents for Intent-Driven Control. IEEE Internet of Things Journal. https://doi.org/10.1109/JIOT.2025.3648858

Bocanet, V. I., Muntean, M. H., & Fleseriu, C. (2025, August). Multi-agent Framework for AI-Supported Collaborative Root Cause Analysis in Quality Assurance. In IFIP International Conference on Advances in Production Management Systems (pp. 202-216). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-032-03538-7_15

Chowdhury, A. (2025). Design and Evaluation of a Synthetic Data-Driven Hybrid ML and LLM Pipeline for Critical Infrastructure Security. Available at SSRN 5429995. https://dx.doi.org/10.2139/ssrn.5429995

Ding, R., Zhang, C., Wang, L., Xu, Y., Ma, M., Wu, X., et al. (2023). TraceDiag: Adaptive, interpretable, and efficient root cause analysis on large-scale microservice systems. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023) (pp. 1762–1773).

EBRAHIMI, S., & ASUDEH, A. (2025). A Survey on Reliability, Transparency, Accountability, and Fairness in LLM-based Multi-Agent Systems through the Responsibility Lens.

Fang, A., Yang, H., Dong, H., Lu, Q., Xu, J., & He, P. (2025). A Goal-Driven Survey on Root Cause Analysis. arXiv preprint arXiv:2510.19593. https://doi.org/10.48550/arXiv.2510.19593

Ghanta, S. (2023). From observability to understanding: Automated incident triage using large language model reasoning over logs, metrics, and traces. International Journal of Engineering & Extended Technologies Research (IJEETR), 5(5), 7242-7249. https://doi.org/10.15662/IJRPETM.2024.0701009

Huang, J., Liu, J., He, S., et al. (2024). FaultProFIT: Automated fault pattern profiling for microservice incident tickets. In Proceedings of the 46th International Conference on Software Engineering (ICSE 2024).

Karras, A., Theodorakopoulos, L., Karras, C., Theodoropoulou, A., Kalliampakou, I., & Kalogeratos, G. (2025). LLMs for Cybersecurity in the Big Data Era: A Comprehensive Review of Applications, Challenges, and Future Directions. Information, 16(11), 957. https://doi.org/10.3390/info16110957

Leesatapornwongsa, T., Faisal, F., & Nath, S. (2025). ReproCopilot: LLM-Driven Failure Reproduction with Dynamic Refinement. Proceedings of the ACM on Software Engineering, 2(FSE), 2920-2943. https://doi.org/10.1145/3729399

Mittamidi, V. K. R. (2025). AI/ML Powered Intelligent Root Cause Analysis and Automated Remediation for Multi System Data Integrity Issues. International Journal of AI, BigData, Computational and Management Studies, 6(4), 133-141. https://doi.org/10.63282/3050-9416.IJAIBDCMS-V6I4P115

Nach, H. (2025, October). LLM-Based Analysis of the AI Incident Database: Insights for AI Governance. In 2025 IEEE International Conference on Technology Management, Operations and Decisions (ICTMOD) (pp. 1-8). IEEE. https://doi.org/10.1109/ICTMOD66732.2025.11372004

Wang, Z., Liu, Z., Zhang, Y., Zhong, A., Fan, L., Wu, L., & Wen, Q. (2023). RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models. arXiv. https://arxiv.org/abs/2310.16340

Roy, D. (2024). Exploring LLM-based agents for root cause analysis. arXiv. https://arxiv.org/abs/2403.04123

Sundar Ray, S. (2023). Autonomous Incident Response Using Generative AI and Agentic Systems in Distributed Enterprise Architectures. Available at SSRN 6647338. https://ssrn.com/abstract=6647338

Szandała, T. (2025, July). Aiops for reliability: Evaluating large language models for automated root cause analysis in chaos engineering. In International Conference on Computational Science (pp. 323-336). Cham: Springer Nature Switzerland. https://doi.org/10.1007/978-3-031-97564-6_25

Tian, Y., Liu, Y., Chong, Z., Huang, Z., & Jacobsen, H. A. (2025). GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis?. arXiv preprint arXiv:2508.12472. https://doi.org/10.48550/arXiv.2508.12472

Wang, Z., Li, J., Ma, M., Li, Z., Kang, Y., Zhang, C., Bansal, C., Chintalapati, M., Rajmohan, S., Lin, Q., et al. (2024). Large language models can provide accurate and interpretable incident triage. In Proceedings of the 35th IEEE International Symposium on Software Reliability Engineering (ISSRE 2024) (pp. 523–534).

Zhang, H. (2024). A Unified AIOps Pipeline for Joint Log–KPI Anomaly Detection, Graph-Based Root Cause Localization, and LLM-Generated Runbooks. Journal of Advanced Computing Systems, 4(3), 57-73. https://doi.org/10.69987/JACS.2024.40305

LLM-Driven Root Cause Analysis for Distributed System Incidents: A Reproducible Framework

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Menu