- Title
- How long will it take to mitigate this incident for online service systems?
- Creator
- Wang, Weijing; Chen, Junjie; Xu, Zhangwei; Dang, Yingnong; Zhang, Dongmei; Yang, Lin; Zhang, Hongyu; Zhao, Pu; Qiao, Bo; Kang, Yu; Lin, Qingwei; Rajmohan, Saravanakumar; Gao, Feng
- Relation
- 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). Proceedings of the 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE) (Wuhan, China 25-28 October, 2021) p. 36-46
- Publisher Link
- http://dx.doi.org/10.1109/ISSRE52982.2021.00017
- Publisher
- Institute of Electrical and Electronics Engineers (IEEE)
- Resource Type
- conference paper
- Date
- 2021
- Description
- Online service systems may encounter a large number of incidents, which should be mitigated as soon as possible to minimize the service disruption time and ensure high service availability. The ability to predict TTM (Time To Mitigation) of incidents can help service teams better organize the mainte-nance efforts. Although there are many traditional bug-fixing time prediction methods, we find that there are not readily available for incident- TTM prediction due to the characteristics of incidents. To better understand how incidents are mitigated, we conduct the first empirical study of incident TTM on 20 large-scale online service systems in Microsoft. We investigate the time distribution in the main stages of the incident life cycle and explore factors affecting TTM. Based on our empirical findings, we propose TTMPred, a deep-learning-based approach for incident- TTM prediction in a continuous triage scenario. Our model designs a two-level attention-based bidirectional GRU model to capture both the semantic information in text data and the temporal information in incremental discussions. And based on a novel continuous loss function, it builds a regression model to achieve accurate TTM prediction as much as possible at each time point of prediction. Our experiments on four large-scale online service systems in Microsoft show that TTMPred is effective and significantly outperforms the compared approaches. For example, TTMPred improves the state-of-the-art regression-based approach by 25.66% on average in terms of MAE (Mean Absolute Error).
- Subject
- incident management; online service systems; mitigation time; prediction
- Identifier
- http://hdl.handle.net/1959.13/1435498
- Identifier
- uon:39739
- Identifier
- ISBN:9781665425872
- Identifier
- ISSN:2332-6549
- Language
- eng
- Reviewed
- Hits: 1795
- Visitors: 1795
- Downloads: 0
Thumbnail | File | Description | Size | Format |
---|