- Title
- Towards Intelligent Incident Management: Why We Need It and How We Make It
- Creator
- Chen, Zhuangbin; Kang, Yu; Li, Liqun; Zhang, Xu; Zhang, Hongyu; Xu, Hui; Zhou, Yangfan; Yang, Li; Sun, Jeffrey; Xu, Zhangwei; Dang, Yingnong; Gao, Feng; Zhao, Pu; Qiao, Bo; Lin, Qingwei; Zhang, Dongmei; Lyu, Michael R.
- Relation
- ESEC/FSE '20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Online 8-13 November, 2020) p. 1487-1497
- Publisher Link
- http://dx.doi.org/10.1145/3368089.3417055
- Publisher
- Association for Computing Machinery (ACM)
- Resource Type
- conference paper
- Date
- 2020
- Description
- The management of cloud service incidents (unplanned interruptions or outages of a service/product) greatly affects customer satisfaction and business revenue. After years of efforts, cloud enterprises are able to solve most incidents automatically and timely. However, in practice, we still observe critical service incidents that occurred in an unexpected manner and orchestrated diagnosis workflow failed to mitigate them. In order to accelerate the understanding of unprecedented incidents and provide actionable recommendations, modern incident management system employs the strategy of AIOps (Artificial Intelligence for IT Operations). In this paper, to provide a broad view of industrial incident management and understand the modern incident management system, we conduct a comprehensive empirical study spanning over two years of incident management practices at Microsoft. Particularly, we identify two critical challenges (namely, incomplete service/resource dependencies and imprecise resource health assessment) and investigate the underlying reasons from the perspective of cloud system design and operations. We also present IcM BRAIN, our AIOps framework towards intelligent incident management, and show its practical benefits conveyed to the cloud services of Microsoft.
- Subject
- AIOps; cloud computing; incident management; artificial intelligence
- Identifier
- http://hdl.handle.net/1959.13/1443064
- Identifier
- uon:41877
- Identifier
- ISBN:9781450370431
- Language
- eng
- Reviewed
- Hits: 2343
- Visitors: 2332
- Downloads: 1
Thumbnail | File | Description | Size | Format |
---|