Signaling analysis: Why did KDDI's major outage last for 60 hours?

Signaling analysis: Why did KDDI's major outage last for 60 hours?

​The KDDI network failure that occurred a few days ago was the largest in KDDI's history and a rare major network failure in the world in recent years. It is worthy of study and lessons learned by the entire communications industry.

Out of curiosity, we collected some fragmentary information and made the following analysis of this accident. Due to our limited technical level, if there is anything inappropriate, please point it out in the comment area. We just hope that this will inspire further thinking and discussion in the industry.

Review of the accident process

According to the KDDI briefing, the accident happened as follows:

  • Starts at 1:35am on July 2
  • Due to a malfunction in the replacement router, voice traffic could not be correctly routed to one of the "VoLTE switches", which directly caused some VoLTE voice services to be interrupted for 15 minutes.
  • The rollback operation was initiated at 1:50 am on July 2, switching the connection back to the old router.
  • At 2:17 am on July 2, due to a large number of terminals initiating location registration signaling to the IMS network to request reconnection to the network, it was found that the "VoLTE switch" was congested.
  • From 3:00 a.m. to 3:22 p.m. on July 2, KDDI implemented traffic control strategies on both the wireless side and the core network side to alleviate congestion in the "VoLTE switch".
  • Starting at 15:22 on the afternoon of July 2, it was discovered that the "user database" was also congested, so two PGW devices in East Japan and two PGWs in West Japan were disconnected to reduce the load on the "user database".
  • Starting at 17:31 on the afternoon of July 2, in order to deal with the data inconsistency problem between the "user database" and the "VoLTE switch", KDDI implemented "session reset" measures on two PGW devices in East Japan and two PGW devices in West Japan, solving the data inconsistency problem.
  • Next, disconnection and session reset operations were also performed on the remaining 13 PGW devices (7 in Eastern Japan and 6 in Western Japan).
  • By implementing the above strategies at 17:30 on the afternoon of July 3, the restoration work in East Japan and West Japan was basically completed.
  • Despite the implementation of the above measures at 4 a.m. on July 4, subsequent network testing and verification found that the load on the "VoLTE switch" and "user database" was not fully alleviated.
  • Subsequently, after the failure lasted for more than two days, KDDI discovered that 6 of its 18 "VoLTE switches" were continuously sending "unnecessary redundant signaling" to the "user database."
  • After the six "VoLTE switches" were disconnected from 12:18 to 13:18 on July 4, the load on the remaining "VoLTE switches" and "user database" was significantly reduced to the level before the failure.
  • The wireless side traffic control was lifted at 14:51 on July 4.
  • At this point, KDDI's major network failure has finally been basically restored.

It is not difficult to see that this accident was not caused by a single fault, but a series of problems caused by a certain fault point. Because of this, the fault lasted for more than 60 hours.

Now the question is, I guess all communications professionals are curious, which network element of the 4G core network is the "VoLTE switch" and "user database" that KDDI refers to? Which links have gone wrong?

Signaling tracking and analysis

We would like to thank our Japanese colleagues for tracking and recording the network signaling after the failure occurred. From the signaling screenshots, we can see two major failure phenomena.

Fault phenomenon 1:

After the VoLTE mobile phone initiates a SIP Register request to the IMS core network, a 500 Cx Unable To Comply or 500 Server Internal Error error is returned, causing the IMS registration to fail.

Query the SIP protocol. 500 Server Internal Error means that the server encountered an unexpected situation that prevented the request from being completed. The client may retry the request after a few seconds.

Cx Unable To Comply: No information was found about the cause of this fault code. However, since Cx refers to the interface between the IMS core network element I/S-CSCF and HSS, which uses Diameter signaling, it may indicate that there is a problem with the link between I/S-CSCF and HSS or between the two.

Fault phenomenon 2:

After the mobile phone attaches to the LTE network and establishes the default EPS bearer, it initiates a PDN Connectivity Request to the network, but returns a PDN Connectivity Reject message, which makes it impossible to establish a SIP signaling bearer with QCI=5.

Open the PDN Connectivity Reject message with the reason Insufficient resources, indicating that the requested service cannot be provided due to insufficient resources.

Both of these signaling anomalies will cause VoLTE user registration failure, which is consistent with KDDI's failure phenomenon, that is, users cannot make or receive VoLTE voice calls.

Next, let’s compare the VoLTE user registration process to see which link went wrong?

EPS and IMS network architecture diagram

The overall VoLTE user registration process includes: EPS attachment, QCI5 bearer establishment, and IMS registration.

It is necessary to explain QCI5 bearer first.

Usually, VoLTE uses a dual APN architecture, including Internet APN and IMS APN. Internet APN is the default APN, and a PDN connection is first established with it after the phone is turned on. The QCI value of its default EPS bearer is usually 9.

After the mobile phone establishes a PDN connection with the Internet APN, it will also establish a PDN connection with the IMS APN. The default EPS bearer QCI value is 5, which is mainly responsible for transmitting SIP signaling.

Bearer refers to the bearer or porter who is responsible for transporting signaling and data from one point to another. In the 4G specification, the QCI values ​​corresponding to different bearer services are defined. Among them, QCI5 has the highest priority and is used for the default bearer of IMS (SIP) signaling; QCI1-4 is second and can be used for VoLTE voice and video calls; QCI6-9 has the lowest priority and can only "do its best" to ensure data transmission.

The specific process is as follows.

EPS attachment and QCI9 default bearer establishment

1, 2, 3, 4, 5: After the UE sends an Attach Request to the MME, the MME and HSS authenticate the UE. After the authentication is successful, the MME obtains the UE's subscription data from the HSS.

6, 7, 8, 9: MME requests SGW/PGW to establish an EPC default bearer (QCI is generally 9) through a Create Session Request message based on the default APN and PDN subscription context in the user subscription data. SGW/PGW sends a Credit-Control-Request (CCR) to PCRF to request a PCC policy for the default bearer. PCRF determines the PCC policy based on the received user subscription data and responds with a Credit-Control-Answer (CCA). Then SGW/PGW sends a Create Session Response to MME to complete the GTP-C session creation process.

10, 11: The MME sends an Attach Accept message to the UE and requests activation of the default EPS bearer; the UE notifies the MME that the default EPS bearer has been activated through an Attach Complete message.

At this point, the UE completes EPS attachment and establishes the QCI9 default bearer.

QCI5 bearer establishment

12, 13, 14, 15, 16: UE sends PDN Connectivity Request to MME, MME sends Create Session Request to SGW/PGW to request to establish QCI5 default bearer, SGW/PGW sends CCR to PCRF to request PCC policy as the default bearer, after PCRF responds with CCA, SGW/PGW sends Create Session Response to MME.

17, 18: The MME sends an Activate Default EPS Bearer Context Request message to the UE to activate the default EPS bearer. The UE responds with an Activate Default EPS Bearer Context Accept message to inform the MME that the default EPS bearer has been activated.

At this point, a default EPS bearer with a QCI value of 5 is established between the UE and the IMS APN. Subsequently, all SIP signaling traffic will be carried over QCI5.

IMS Registration

19, 20, 21: UE initiates IMS registration by sending SIP REGISTER to P-CSCF. I-CSCF sends User-Authorization-Request (UAR) to HSS to query user registration status. After HSS authorizes the user to use IMS services, it returns the S-CSCF address of the user in the User-Authorization-Answer (UAA) response.

22, 23, 24, 25, 26: I-CSCF forwards the SIP REGISTER to the designated S-CSCF. The S-CSCF sends a Multimedia-Auth-Request (MAR) to the HSS to request authentication information. After the HSS responds with a Multimedia-Auth-Answer (MAA), the S-CSCF sends the authentication information to the UE via a 401 UnAuthorized message to complete the UE's authentication of the network side.

27, 28, 29, 30, 31, 32, 33: UE initiates the second registration request and response process to IMS to complete the network side's UE authentication and download the user's IMS subscription data. The detailed steps are similar to the first registration.

Comparing the signaling tracing and VoLTE registration process, the cause of this VoLTE voice failure may occur between CSCF and HSS, and between S/PGW and PCRF. (As shown in the red star mark in the signaling flow chart)

Compared with the KDDI fault report, the "VoLTE switch" mentioned may be the CSCF network element, and the "user database" may be the HSS network element, or the HSS and PCRF fusion network element.

CSCF, Call Session Control Function, is a key network element entity function in the IMS network architecture. It is divided into three types: P/S/I according to location and function. Among them, P-CSCF (Proxy CSCF) is the initial access point of the IMS network. All sessions starting and ending at SIP terminals pass through P-CSCF; S-CSCF (Serving CSCF) is in a core control position in the IMS core network. It cooperates with HSS network elements to authenticate users, downloads user subscription information from HSS, and performs routing triggers and service control according to the IMS trigger rules signed by the user, as well as manages basic session routing; I-CSCF (Interrogating CSCF) is the entry point of the IMS home network. During the registration process, I-CSCF selects an S-CSCF for the user by querying HSS.

HSS, Home Subscriber Server, stores and manages user subscription data, including user authentication information, location information, routing information, etc. In the VoLTE network architecture, EPC HSS and IMS HSS can be deployed in a converged manner.

PCRF, Policy and Charging Control Unit, is used for differentiated service operations such as user information management, PCC policy management, dynamic generation of PCC policies, and event triggering.

Diameter signaling abnormality?

Let’s review the KDDI failure report again. There are two points worth noting.

(1) KDDI stated at a press conference that after the rollback operation, although a considerable number of users initiated reconnection to the "VoLTE switch", the number of these users was not the total number of KDDI users. At the same time, KDDI has 18 "VoLTE switches" nationwide, and they are mutually redundant. KDDI has also conducted simulation tests and found that even if all users initiated reconnection, it would not cause VoLTE congestion. Therefore, there may be other reasons lurking in this accident.

(2) After the congestion of the "VoLTE switch", despite the implementation of access restrictions, flow control, disconnection of some PGW network elements and other measures, the load of the "VoLTE switch" and "user database" was not fully alleviated. It was not until the failure lasted for more than two days that KDDI further discovered that 6 of its 18 "VoLTE switches" continued to send "unnecessary redundant signaling" to the "user database". After disconnecting these 6 "VoLTE switches", the load of the remaining "VoLTE switches" and "user database" was greatly reduced to the level before the failure.

The so-called "VoLTE switch" continuously sends "unnecessary redundant signaling" to "user data", that is, the CSCF network element continuously sends abnormal signaling to the HSS (or the HSS and PCRF fusion network element).

In the 4G network architecture, the interface between the I/S-CSCF and the HSS is the Cx interface, which uses Diameter signaling.

Diameter signaling is mainly used in EPC systems, policy and charging control PCC systems and IMS domains, and is mainly used for user authentication, data, policy, and charging management.

The network elements and interfaces that use Diameter signaling in EPC, PCC, and IMS networks include: the interface between I/S-CSCF and HSS, the Gx interface between PCRF and PGW, the S6a interface between HSS and MME, etc.

From the previous analysis, we can see that the fault points of this accident all occurred in the interfaces and network elements related to Diameter signaling.

Therefore, it is suspected that there is another major fault lurking in this accident: an abnormality in the Diameter signaling network.

Of course, the above is just an immature analysis based on some fragmented information. The specific reasons can only be known after KDDI releases a detailed report. ​

<<:  Understanding Neutral Host Networks Using Private 5G

>>:  Six years after LPWAN became popular, what happened to non-cellular IoT technologies?

Recommend

5 Essential Predictions for Blockchain Trends in 2018

The potential for blockchain technology to bring ...

7.2 Our computer room is disconnected from the Internet! What should I do?

1. Background At 10:04 on July 2, 2024, the publi...

RackNerd: $19.99/year KVM-1.8GB/28GB/3TB/Los Angeles Data Center

RackNerd has launched some promotions in Los Ange...

36.2%! H3C leads the Chinese campus switch market

Recently, IDC released the "China Ethernet S...

ElasticSearch IK Tokenizer Quick Start

1. Install IK word segmenter 1. Allocate a pseudo...

Detailed explanation of Nginx configuration SSL (HTTPS)

As Internet security becomes increasingly importa...

Huawei obtains the world's first PUE test certificate for micro-module products

The 4th Data Center Infrastructure Summit was suc...

What benefits will 5G technology bring to smart fire protection construction?

[[346255]] On the one hand, it is because various...

...