Automated Incident Runbook Generation via LLMs: An Open Framework and Benchmark

Authors

  • Akhil Reddy Mandadi Independent Researcher

DOI:

https://doi.org/10.63956/jitar.v1i1.65

Keywords:

Incident Response, Large Language Models, Runbook Automation, Site Reliability Engineering, AIOps, Incident Postmortems, LLM-as-Judge, Cloud Reliability, Operational Intelligence.

Abstract

This is due to the fact that today's software architectures rely heavily on highly-distributed cloud-native systems, micro services architectures and automated operational environments which are constantly creating complex incidents. Incident runbooks are a critical component of production systems that lend structure to procedures and are of great help to engineers on diagnosis, mitigation and recovery. Traditional runbook creation is however still largely manual, involves lots of knowledge and effort from experts and the need to keep track of happenings in the infrastructure as it is integrated, configured and developed. This means that operational documentation is often ill-structured, out of date and hard to maintain. This study aims to create and assess a generative framework of incident runbooks (IRs) based on Large Language Models (LLMs). To achieve this, the following specific aims are set for this study: (1) To strip down incident artifacts (mortem, resolution transcript, or log from operational communication) into an actionable, systemized runbook and to provide an open benchmark for a systematic evaluation.

Design/methodology/approach:

The proposed framework has three elements, all linked together. Structured incident feature extraction techniques extract important operational elements like symptoms, triggering events, root causes, mitigation activities, etc., and also the resolution timelines from incident artifacts. Second, extracted features are then fed into LLM-based procedural synthesis mechanism to generate runbook automatically. Third, a rubric-based framework for the validation of generated outputs using an LLM-as-judge evaluates the outputs based on the specified rubric attributes: completeness, coherence, correctness and operational usefulness. An open benchmark of over 200 publishable incident reports was developed to enable experimentation by drawing on reports from cloud service providers and software platforms. Several different LLM configurations were tried with synthesized gold-standard runbooks.

Findings:

The results indicated that using a structured extraction along with an LLM-based synthesis approach enhances the procedural uniformity and contextual appropriateness of the generated runbooks. Results from the experiments showed that there are measurable differences in performance between the models evaluated and that the rubric-based approach to assessment is effective as an effective tool for assessing operational documentation quality.

Research limitations/implications:

Relying on data from incidents one of the basic public data sets available is used, and this data is reported differently and varies in how complete it is. In addition, the outputs produced are still dependent on model reasoning features and variations of prompts.

Practical implications:

The proposed approach can significantly decrease documentation maintenance expenses, enhance incident readiness, speed up operational knowledge sharing, and enable scalable incident response processes within an enterprise.

Originality/value/Novelty:

This research circumvents the need for human experts to evaluate the quality of OpsDocs and presents a new validation method for OpsDocs: rubric validation using LLM-as-judge.

References

Mao, J., Li, L., Gao, Y., Peng, Z., He, S., Zhang, C., ... & Zhang, D. (2025). Agentic Troubleshooting Guide Automation for Incident Management. arXiv preprint arXiv:2510.10074. https://doi.org/10.1145/3808143

Galadima, H. S., Doherty, C., & Brennan, R. (2024, November). Towards llm-based synthetic dataset generation of cyber incident response process logs. In 2024 Cyber Research Conference-Ireland (Cyber-RCI) (pp. 1-4). IEEE. https://doi.org/10.1109/Cyber-RCI60769.2024.10939563

Zhang, H. (2024). A Unified AIOps Pipeline for Joint Log–KPI Anomaly Detection, Graph-Based Root Cause Localization, and LLM-Generated Runbooks. Journal of Advanced Computing Systems, 4(3), 57-73. https://doi.org/10.69987/JACS.2024.40305

Paduraru, C., Dumitru, B., & Stefanescu, A. (2025). Automated Generation of Cybersecurity Response Playbooks via Large Language Models. Procedia Computer Science, 270, 2987-2996. https://doi.org/10.1016/j.procs.2025.09.423

Kakarla, R. (2024). LLM-Based Autonomous Remediation for DevSecOps Pipelines. The Eastasouth Journal of Information System and Computer Science, 2(02), 179-188. https://doi.org/10.58812/esiscs.v2i02.856

Sarda, K. (2023, September). Leveraging large language models for auto-remediation in microservices architecture. In 2023 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C) (pp. 16-18). IEEE.

Jha, S., Arora, R., Watanabe, Y., Yanagawa, T., Chen, Y., Clark, J., ... & Puri, R. (2025). Itbench: Evaluating ai agents across diverse real-world it automation tasks. arXiv preprint arXiv:2502.05352. https://doi.org/10.48550/arXiv.2502.05352

Rodriguez, E. V., Chen, J. T., & Adebayo, M. O. (2024). Self-Healing Cloud Systems: Leveraging GenAI for Incident Prediction and Configuration Drift Remediation.

Wang, C., Yuan, T., Hua, C., Chang, L., Yang, X., & Qiu, Z. (2025). Integrating large language models with cloud-native observability for automated root cause analysis and remediation. In Proceedings of the 2025 3rd International Conference on Artificial Intelligence, Systems and Network Security (pp. 327-334). https://doi.org/10.1145/3797161.3797213

Thota, M. R. (2022). Foundation Models as Platform Infrastructure: Integrating Large Language Models into Internal Developer Platforms for Scalable Productivity. https://doi.org/10.32628/IJSRST2295163

Gershon, T., Seelam, S., Belgodere, B., Bonilla, M., Hoang, L., Barnett, D., ... & Gallen, E. (2024). The infrastructure powering IBM's Gen AI model development. arXiv preprint arXiv:2407.05467. https://doi.org/10.48550/arXiv.2407.05467

Kuppam, M. (2024). Enterprise digital reliability: building security, usability, and digital trust. Springer Nature.

Rodrigues, M. I. F. (2024). Knowledge Management System for Cybersecurity Incident Response (Master's thesis, ISCTE-Instituto Universitario de Lisboa (Portugal)).

Shrivastava, S., & Srivastav, N. (2024). Solutions Architect's Handbook: Kick-start your career with architecture design principles, strategies, and generative AI techniques. Packt Publishing Ltd.

Wang, Y., Wang, Z., Zhu, D., Zhong, J., & Li, W. (2025). Governance-ready small language models for medical imaging: Prompting, abstention, and pacs integration. arXiv preprint arXiv:2508.13378. https://doi.org/10.48550/arXiv.2508.13378

Sirimalla, A. (2024). Self-Healing Cloud Database Platforms: Python Automation and Machine Learning for Proactive Issue Detection Across Multi-Cloud Oracle and SQL Server Deployments. ISCSITR-INTERNATIONAL JOURNAL OF CLOUD COMPUTING (ISCSITR-IJCC)-ISSN (Online): 3067-7378, 5(1), 15-41. http://www.doi.org/10.63397/ISCSITR-IJCC_2024_05_01_003

Gondi, S. (2025). FINTECH TRANSFORMATION: AI AND RPA BOTS FOR MULTI-AGENCY PAYMENT RECONCILIATION IN ERP. International Journal of Applied Mathematics, 38(10s), 304-331. https://doi.org/10.12732/ijam.v38i10s.944

Feng, Y., Liu, Z., Yuan, L., Luo, S., Dong, S., Wang, S., & Ferry, B. (2023). Detecting text-rich objects: OCR or object detection? A case study with stopwatch detection.

Rybalchenko, A. (2025). Integrating AI as an enterprise architecture level design element for financial institutions [Diploma Thesis, Technische Universität Wien]. reposiTUm. https://doi.org/10.34726/hss.2025.131663

Ravichandran, N., Tewaraja, T., Rajasegaran, V., Kumar, S. S., Gunasekar, S. K. L., & Sindiramutty, S. R. (2024). Comprehensive review analysis and countermeasures for cybersecurity threats: DDoS, ransomware, and Trojan horse attacks. https://doi.org/10.20944/preprints202409.1369.v

Downloads

Published

01-10-2025

How to Cite

Mandadi, A. R. (2025). Automated Incident Runbook Generation via LLMs: An Open Framework and Benchmark. JITAR : Journal of Information Technology and Applications Research, 1(1), 162–182. https://doi.org/10.63956/jitar.v1i1.65