A thought-provoking report on a major communications failure

Starting at 1:35 a.m. local time on July 2, a large-scale communication failure occurred in the mobile network of Japanese operator KDDI, resulting in the inability to make calls, send and receive text messages, and slow data communications across Japan.

The accident had a wide impact range and lasted for a long time, affecting 39.15 million users. The failure did not last until the afternoon of July 4 when it was basically fully restored. It caused great inconvenience and losses to the entire Japanese society. It was also the largest network system failure KDDI has ever encountered.

After the outage occurred, KDDI senior management held a press conference in a timely manner, bowed and apologized to the vast number of deeply affected individual and corporate users, and expressed that they were considering compensating for the losses.

So what exactly caused this large-scale communication failure? After reading KDDI's report, it is thought-provoking.

Cause 1: Core router cutover failed

In the early morning of July 2, KDDI organized engineers to cut over a core router that connects the national mobile core network and relay network, replacing the old core router with a new product.

Unfortunately, the worst nightmare for the communications people happened - the cutover failed. During the process of replacing the core router, the new core router broke down for unknown reasons.

Friends who work in communications know that core routers are located at the core of the network and are the "transportation hub" of the entire network. Not only are they powerful and expensive, but they also need to maintain stable operation at all times. Otherwise, once a problem occurs, it may affect millions or even tens of millions of users across the entire network.

For this reason, core router cutover is like replacing the "heart" of a living person. It is an extremely challenging task and also places extremely high demands on the maturity, stability, interoperability and other capabilities of the new product to be replaced.

But KDDI failed to do this extremely cautious job, and the consequences were of course quite serious.

Because the new core router was unable to correctly route voice traffic to the VoLTE switching node, some VoLTE voice services were interrupted for 15 minutes.

Cause 2: Signaling storm destroys VoLTE network

The core router cutover failed. This scene is so unimaginable that it makes me break into a cold sweat even through the screen!

What to do? Quickly roll back. KDDI engineers quickly initiated the rollback operation and switched the connection back to the old core router at 1:50 am on July 2.

But a bigger problem occurred.

After the rollback, "since VoLTE terminals register their location every 50 minutes", a large number of terminals initiated location registration signaling to the VoLTE switching node to reconnect to the network. The massive signaling burst quickly caused congestion in the VoLTE switching node, making a large number of users unable to communicate in VoLTE.

At the same time, there is a "user database" in the mobile network, which is responsible for storing the user's contract data and location information. Due to the congestion of VoLTE switching nodes, "the location information registered in the user database cannot be reflected on the VoLTE switch", resulting in data mismatch problems, which also caused many users to be unable to communicate and make calls.

In response to this situation, KDDI began to implement traffic control strategies on both the wireless side and the VoLTE core network side after 3:00 a.m. on July 2, and reduced the load on the user database by disconnecting the PGW to alleviate network congestion. It also adopted the "session reset" measure in the PGW to resolve the data inconsistency problem in the user database.

The implementation of traffic control subsequently made it difficult to connect data communications and voice calls across the country.

Next, KDDI began intensive network restoration work. At 11 a.m. on July 3, KDDI announced that network restoration work was basically completed in western Japan. At 5:30 p.m., it was basically completed in eastern Japan. However, some users still had difficulty in data communications and voice calls.

It was not until 4 p.m. on July 4, 62 hours after the outage, that KDDI said it had basically restored services nationwide.

Thought-provoking

This is not the first time such a major network outage has occurred in Japan.

On October 14, 2021, a major nationwide communications accident occurred in the mobile network of NTT DoCoMo, another Japanese operator, which resulted in a large number of mobile phone users being unable to make calls and data communications.

This accident was also caused by the fallback operation after the cutover failed, which triggered a huge explosion of signaling traffic and led to severe network congestion.

Specifically, NTT DoCoMo encountered problems when replacing network equipment used to store user and location information of IoT terminal devices, and then immediately initiated a rollback operation and reverted to the old equipment.

However, this rollback operation caused a large number of IoT terminals to re-initiate location registration information to the old devices. The surging "signaling storm" quickly caused network congestion and affected the voice and data packet core equipment of the 3G/4G/5G networks, resulting in a large number of users being unable to make calls and data communications.

Unlike NTT DoCoMo, KDDI's outage this time was caused by a failure in core router cutover, and the failure lasted much longer.

But it is worth mentioning that KDDI does not seem to have learned the lessons of DoCoMo.

KDDI has 6 exchange centers in Japan, with a total of 18 VoLTE exchange nodes, and the VoLTE exchange nodes in the exchange centers are mutually redundant. The VoLTE service was interrupted due to the core router cutover only in one of the exchange centers' VoLTE exchange nodes.

"We have done stress testing. Because there is redundant backup, even if all terminals within a switching center initiate reconnection requests at the same time, congestion will not occur."

KDDI said, "But for some unknown reason, congestion still occurred. We have not yet fully figured out what went wrong."

I hope KDDI can finally find out all the reasons for this accident. I also hope that the communications industry will never make the same mistake again. Because, major network failure, these six words, are too scary for the communications industry.

<<: What role does Wi-Fi-6 play in the field of industrial IoT?

>>: Understanding Cloud Networks in One Article

How to make the audit of data center assets more efficient?

KVMLA Japan/Singapore dedicated server monthly payment 30% off, 595 yuan/month E3-1230v3, 16G memory, 480G SSD, 20M bandwidth

Blog

What should you consider before looking for an SD-WAN provider?

Recommend

EtherNetservers special VPS starting from $14.95 per year - 1GB/40G SSD/1TB@10Gbps/Los Angeles & Miami & New Jersey data centers

EtherNetservers is a foreign hosting company esta...

Eurocloud newly launched AS9929 line high defense in Los Angeles Cera data center with 50% discount starting from $2.5/month

At the end of last month, OULUCLOUD launched a ne...

Summary information: PIGYun/Lisa Host/Vmshell/CoNoov/Vollcloud/KuaiKuai Network/LiuLiu Cloud/Yitan Cloud/Yunmi Technology

Recently, we have received a lot of product or pr...

A thought-provoking report on a major communications failure

Cause 1: Core router cutover failed

Cause 2: Signaling storm destroys VoLTE network

Thought-provoking

How to make the audit of data center assets more efficient?

How to Get the Most Out of Network Performance Management Tools?

Wi-Fi 6E: When will it arrive? What will be the impact?

The road is wide and the traffic is fast! The three major operators help IPTV to have a new spring

KVMLA Japan/Singapore dedicated server monthly payment 30% off, 595 yuan/month E3-1230v3, 16G memory, 480G SSD, 20M bandwidth

What should you consider before looking for an SD-WAN provider?

Public transport Wi-Fi is too difficult to monetize and too costly to be shut down in more than a dozen cities

What are the deployments and arrangements for 5G in 2022? MIIT responds

AlphaVPS: VPS hosting with large hard disk starts from €15/year, AMD EPYC/Ryzen starts from €2.99/month

After 4 years, 5G has blossomed

Recommend

EtherNetservers special VPS starting from $14.95 per year - 1GB/40G SSD/1TB@10Gbps/Los Angeles & Miami & New Jersey data centers

ColoCrossing US VPS 50% off, $1.97/month-1GB/25G SSD/20TB@1Gbps

How 5G deployment will impact enterprise network hardware and software

What kind of sparks will be created when 5G meets the power grid?

iONcloud adds new bare metal server, $78/month - E3-1230v2/16GB memory/1TB SSD hard drive

Launchvps: $39.4/year KVM-4GB/80GB/3TB/Philadelphia

Deloitte: 86% of network executives said 5G and Wi-Fi 6 will change the industry within 3 years

CloudCone: 1GB memory KVM annual payment of $12.95, 2GB memory KVM annual payment from $15

Why MAC addresses do not need to be globally unique

Domestic IPv6 system deployment speeds up and IPv6 application boom is coming

Eurocloud newly launched AS9929 line high defense in Los Angeles Cera data center with 50% discount starting from $2.5/month

What does the TTL value returned by the Ping command mean and what does it do?

Review of 2019 | 5G: Networks and terminals develop rapidly, and manufacturers are betting on it

2017 Prediction: Will Networking and Security Finally Merge?

Summary information: PIGYun/Lisa Host/Vmshell/CoNoov/Vollcloud/KuaiKuai Network/LiuLiu Cloud/Yitan Cloud/Yunmi Technology