The KDDI network failure that occurred a few days ago was the largest in KDDI's history and a rare major network failure in the world in recent years. It is worthy of study and lessons learned by the entire communications industry. Out of curiosity, we collected some fragmentary information and made the following analysis of this accident. Due to our limited technical level, if there is anything inappropriate, please point it out in the comment area. We just hope that this will inspire further thinking and discussion in the industry. Review of the accident processAccording to the KDDI briefing, the accident happened as follows:
It is not difficult to see that this accident was not caused by a single fault, but a series of problems caused by a certain fault point. Because of this, the fault lasted for more than 60 hours. Now the question is, I guess all communications professionals are curious, which network element of the 4G core network is the "VoLTE switch" and "user database" that KDDI refers to? Which links have gone wrong? Signaling tracking and analysisWe would like to thank our Japanese colleagues for tracking and recording the network signaling after the failure occurred. From the signaling screenshots, we can see two major failure phenomena. Fault phenomenon 1: After the VoLTE mobile phone initiates a SIP Register request to the IMS core network, a 500 Cx Unable To Comply or 500 Server Internal Error error is returned, causing the IMS registration to fail. Query the SIP protocol. 500 Server Internal Error means that the server encountered an unexpected situation that prevented the request from being completed. The client may retry the request after a few seconds. Cx Unable To Comply: No information was found about the cause of this fault code. However, since Cx refers to the interface between the IMS core network element I/S-CSCF and HSS, which uses Diameter signaling, it may indicate that there is a problem with the link between I/S-CSCF and HSS or between the two. Fault phenomenon 2:After the mobile phone attaches to the LTE network and establishes the default EPS bearer, it initiates a PDN Connectivity Request to the network, but returns a PDN Connectivity Reject message, which makes it impossible to establish a SIP signaling bearer with QCI=5. Open the PDN Connectivity Reject message with the reason Insufficient resources, indicating that the requested service cannot be provided due to insufficient resources. Both of these signaling anomalies will cause VoLTE user registration failure, which is consistent with KDDI's failure phenomenon, that is, users cannot make or receive VoLTE voice calls. Next, let’s compare the VoLTE user registration process to see which link went wrong? EPS and IMS network architecture diagram The overall VoLTE user registration process includes: EPS attachment, QCI5 bearer establishment, and IMS registration. It is necessary to explain QCI5 bearer first. Usually, VoLTE uses a dual APN architecture, including Internet APN and IMS APN. Internet APN is the default APN, and a PDN connection is first established with it after the phone is turned on. The QCI value of its default EPS bearer is usually 9. After the mobile phone establishes a PDN connection with the Internet APN, it will also establish a PDN connection with the IMS APN. The default EPS bearer QCI value is 5, which is mainly responsible for transmitting SIP signaling. Bearer refers to the bearer or porter who is responsible for transporting signaling and data from one point to another. In the 4G specification, the QCI values corresponding to different bearer services are defined. Among them, QCI5 has the highest priority and is used for the default bearer of IMS (SIP) signaling; QCI1-4 is second and can be used for VoLTE voice and video calls; QCI6-9 has the lowest priority and can only "do its best" to ensure data transmission. The specific process is as follows. EPS attachment and QCI9 default bearer establishment1, 2, 3, 4, 5: After the UE sends an Attach Request to the MME, the MME and HSS authenticate the UE. After the authentication is successful, the MME obtains the UE's subscription data from the HSS. 6, 7, 8, 9: MME requests SGW/PGW to establish an EPC default bearer (QCI is generally 9) through a Create Session Request message based on the default APN and PDN subscription context in the user subscription data. SGW/PGW sends a Credit-Control-Request (CCR) to PCRF to request a PCC policy for the default bearer. PCRF determines the PCC policy based on the received user subscription data and responds with a Credit-Control-Answer (CCA). Then SGW/PGW sends a Create Session Response to MME to complete the GTP-C session creation process. 10, 11: The MME sends an Attach Accept message to the UE and requests activation of the default EPS bearer; the UE notifies the MME that the default EPS bearer has been activated through an Attach Complete message. At this point, the UE completes EPS attachment and establishes the QCI9 default bearer. QCI5 bearer establishment12, 13, 14, 15, 16: UE sends PDN Connectivity Request to MME, MME sends Create Session Request to SGW/PGW to request to establish QCI5 default bearer, SGW/PGW sends CCR to PCRF to request PCC policy as the default bearer, after PCRF responds with CCA, SGW/PGW sends Create Session Response to MME. 17, 18: The MME sends an Activate Default EPS Bearer Context Request message to the UE to activate the default EPS bearer. The UE responds with an Activate Default EPS Bearer Context Accept message to inform the MME that the default EPS bearer has been activated. At this point, a default EPS bearer with a QCI value of 5 is established between the UE and the IMS APN. Subsequently, all SIP signaling traffic will be carried over QCI5. IMS Registration19, 20, 21: UE initiates IMS registration by sending SIP REGISTER to P-CSCF. I-CSCF sends User-Authorization-Request (UAR) to HSS to query user registration status. After HSS authorizes the user to use IMS services, it returns the S-CSCF address of the user in the User-Authorization-Answer (UAA) response. 22, 23, 24, 25, 26: I-CSCF forwards the SIP REGISTER to the designated S-CSCF. The S-CSCF sends a Multimedia-Auth-Request (MAR) to the HSS to request authentication information. After the HSS responds with a Multimedia-Auth-Answer (MAA), the S-CSCF sends the authentication information to the UE via a 401 UnAuthorized message to complete the UE's authentication of the network side. 27, 28, 29, 30, 31, 32, 33: UE initiates the second registration request and response process to IMS to complete the network side's UE authentication and download the user's IMS subscription data. The detailed steps are similar to the first registration. Comparing the signaling tracing and VoLTE registration process, the cause of this VoLTE voice failure may occur between CSCF and HSS, and between S/PGW and PCRF. (As shown in the red star mark in the signaling flow chart) Compared with the KDDI fault report, the "VoLTE switch" mentioned may be the CSCF network element, and the "user database" may be the HSS network element, or the HSS and PCRF fusion network element. CSCF, Call Session Control Function, is a key network element entity function in the IMS network architecture. It is divided into three types: P/S/I according to location and function. Among them, P-CSCF (Proxy CSCF) is the initial access point of the IMS network. All sessions starting and ending at SIP terminals pass through P-CSCF; S-CSCF (Serving CSCF) is in a core control position in the IMS core network. It cooperates with HSS network elements to authenticate users, downloads user subscription information from HSS, and performs routing triggers and service control according to the IMS trigger rules signed by the user, as well as manages basic session routing; I-CSCF (Interrogating CSCF) is the entry point of the IMS home network. During the registration process, I-CSCF selects an S-CSCF for the user by querying HSS. HSS, Home Subscriber Server, stores and manages user subscription data, including user authentication information, location information, routing information, etc. In the VoLTE network architecture, EPC HSS and IMS HSS can be deployed in a converged manner. PCRF, Policy and Charging Control Unit, is used for differentiated service operations such as user information management, PCC policy management, dynamic generation of PCC policies, and event triggering. Diameter signaling abnormality?Let’s review the KDDI failure report again. There are two points worth noting. (1) KDDI stated at a press conference that after the rollback operation, although a considerable number of users initiated reconnection to the "VoLTE switch", the number of these users was not the total number of KDDI users. At the same time, KDDI has 18 "VoLTE switches" nationwide, and they are mutually redundant. KDDI has also conducted simulation tests and found that even if all users initiated reconnection, it would not cause VoLTE congestion. Therefore, there may be other reasons lurking in this accident. (2) After the congestion of the "VoLTE switch", despite the implementation of access restrictions, flow control, disconnection of some PGW network elements and other measures, the load of the "VoLTE switch" and "user database" was not fully alleviated. It was not until the failure lasted for more than two days that KDDI further discovered that 6 of its 18 "VoLTE switches" continued to send "unnecessary redundant signaling" to the "user database". After disconnecting these 6 "VoLTE switches", the load of the remaining "VoLTE switches" and "user database" was greatly reduced to the level before the failure. The so-called "VoLTE switch" continuously sends "unnecessary redundant signaling" to "user data", that is, the CSCF network element continuously sends abnormal signaling to the HSS (or the HSS and PCRF fusion network element). In the 4G network architecture, the interface between the I/S-CSCF and the HSS is the Cx interface, which uses Diameter signaling. Diameter signaling is mainly used in EPC systems, policy and charging control PCC systems and IMS domains, and is mainly used for user authentication, data, policy, and charging management. The network elements and interfaces that use Diameter signaling in EPC, PCC, and IMS networks include: the interface between I/S-CSCF and HSS, the Gx interface between PCRF and PGW, the S6a interface between HSS and MME, etc. From the previous analysis, we can see that the fault points of this accident all occurred in the interfaces and network elements related to Diameter signaling. Therefore, it is suspected that there is another major fault lurking in this accident: an abnormality in the Diameter signaling network. Of course, the above is just an immature analysis based on some fragmented information. The specific reasons can only be known after KDDI releases a detailed report. |
<<: Understanding Neutral Host Networks Using Private 5G
>>: Six years after LPWAN became popular, what happened to non-cellular IoT technologies?
Many friends have asked, how to set the IP addres...
The potential for blockchain technology to bring ...
Recently, the first half financial reports of the...
1. Background At 10:04 on July 2, 2024, the publi...
RackNerd has launched some promotions in Los Ange...
Recently, IDC released the "China Ethernet S...
In the era of cloud computing, IT system construc...
1. Install IK word segmenter 1. Allocate a pseudo...
BudgetVM is a foreign hosting company that has be...
[[375750]] This article is reprinted from WeChat ...
TmhHost launched a promotion during the May Day I...
As Internet security becomes increasingly importa...
The 4th Data Center Infrastructure Summit was suc...
[[346255]] On the one hand, it is because various...