Starting at 1:35 a.m. local time on July 2, a large-scale communication failure occurred in the mobile network of Japanese operator KDDI, resulting in the inability to make calls, send and receive text messages, and slow data communications across Japan. The accident had a wide impact range and lasted for a long time, affecting 39.15 million users. The failure did not last until the afternoon of July 4 when it was basically fully restored. It caused great inconvenience and losses to the entire Japanese society. It was also the largest network system failure KDDI has ever encountered. After the outage occurred, KDDI senior management held a press conference in a timely manner, bowed and apologized to the vast number of deeply affected individual and corporate users, and expressed that they were considering compensating for the losses. So what exactly caused this large-scale communication failure? After reading KDDI's report, it is thought-provoking. Cause 1: Core router cutover failedIn the early morning of July 2, KDDI organized engineers to cut over a core router that connects the national mobile core network and relay network, replacing the old core router with a new product. Unfortunately, the worst nightmare for the communications people happened - the cutover failed. During the process of replacing the core router, the new core router broke down for unknown reasons. Friends who work in communications know that core routers are located at the core of the network and are the "transportation hub" of the entire network. Not only are they powerful and expensive, but they also need to maintain stable operation at all times. Otherwise, once a problem occurs, it may affect millions or even tens of millions of users across the entire network. For this reason, core router cutover is like replacing the "heart" of a living person. It is an extremely challenging task and also places extremely high demands on the maturity, stability, interoperability and other capabilities of the new product to be replaced. But KDDI failed to do this extremely cautious job, and the consequences were of course quite serious. Because the new core router was unable to correctly route voice traffic to the VoLTE switching node, some VoLTE voice services were interrupted for 15 minutes. Cause 2: Signaling storm destroys VoLTE networkThe core router cutover failed. This scene is so unimaginable that it makes me break into a cold sweat even through the screen! What to do? Quickly roll back. KDDI engineers quickly initiated the rollback operation and switched the connection back to the old core router at 1:50 am on July 2. But a bigger problem occurred. After the rollback, "since VoLTE terminals register their location every 50 minutes", a large number of terminals initiated location registration signaling to the VoLTE switching node to reconnect to the network. The massive signaling burst quickly caused congestion in the VoLTE switching node, making a large number of users unable to communicate in VoLTE. At the same time, there is a "user database" in the mobile network, which is responsible for storing the user's contract data and location information. Due to the congestion of VoLTE switching nodes, "the location information registered in the user database cannot be reflected on the VoLTE switch", resulting in data mismatch problems, which also caused many users to be unable to communicate and make calls. In response to this situation, KDDI began to implement traffic control strategies on both the wireless side and the VoLTE core network side after 3:00 a.m. on July 2, and reduced the load on the user database by disconnecting the PGW to alleviate network congestion. It also adopted the "session reset" measure in the PGW to resolve the data inconsistency problem in the user database. The implementation of traffic control subsequently made it difficult to connect data communications and voice calls across the country. Next, KDDI began intensive network restoration work. At 11 a.m. on July 3, KDDI announced that network restoration work was basically completed in western Japan. At 5:30 p.m., it was basically completed in eastern Japan. However, some users still had difficulty in data communications and voice calls. It was not until 4 p.m. on July 4, 62 hours after the outage, that KDDI said it had basically restored services nationwide. Thought-provokingThis is not the first time such a major network outage has occurred in Japan. On October 14, 2021, a major nationwide communications accident occurred in the mobile network of NTT DoCoMo, another Japanese operator, which resulted in a large number of mobile phone users being unable to make calls and data communications. This accident was also caused by the fallback operation after the cutover failed, which triggered a huge explosion of signaling traffic and led to severe network congestion. Specifically, NTT DoCoMo encountered problems when replacing network equipment used to store user and location information of IoT terminal devices, and then immediately initiated a rollback operation and reverted to the old equipment. However, this rollback operation caused a large number of IoT terminals to re-initiate location registration information to the old devices. The surging "signaling storm" quickly caused network congestion and affected the voice and data packet core equipment of the 3G/4G/5G networks, resulting in a large number of users being unable to make calls and data communications. Unlike NTT DoCoMo, KDDI's outage this time was caused by a failure in core router cutover, and the failure lasted much longer. But it is worth mentioning that KDDI does not seem to have learned the lessons of DoCoMo. KDDI has 6 exchange centers in Japan, with a total of 18 VoLTE exchange nodes, and the VoLTE exchange nodes in the exchange centers are mutually redundant. The VoLTE service was interrupted due to the core router cutover only in one of the exchange centers' VoLTE exchange nodes. "We have done stress testing. Because there is redundant backup, even if all terminals within a switching center initiate reconnection requests at the same time, congestion will not occur." KDDI said, "But for some unknown reason, congestion still occurred. We have not yet fully figured out what went wrong." I hope KDDI can finally find out all the reasons for this accident. I also hope that the communications industry will never make the same mistake again. Because, major network failure, these six words, are too scary for the communications industry. |
<<: What role does Wi-Fi-6 play in the field of industrial IoT?
>>: Understanding Cloud Networks in One Article
Edge devices are more than just hardware, and wit...
On October 24, the 1024 Kunpeng Programmer's ...
From July 5 to 7, 2021, the "9th China Comma...
I believe many of my friends have encountered suc...
In order to let new users understand and experien...
2020 is the starting point for the substantial de...
After more than two years of development, 5G has ...
Sustainability and reducing energy consumption ar...
Edge computing has quickly become popular for com...
At the Hot Chips conference, PCI-SIG said it woul...
As the number of employees or departments increas...
1. Common scenario 1 - restaurant/hotel 1. Backgr...
When you look at your mobile network or home broa...
[October 13, Yangquan, Shanxi] On October 13, the...