December 6, 2018 was a nightmare day for Japanese operator SoftBank. At 1:39 p.m., 18 4G core network elements in SoftBank's two major central computer rooms in East Japan and West Japan suddenly malfunctioned, causing a large number of users across the network to be unable to communicate normally. SoftBank was stunned by the sudden major failure. Everyone from the CTO to the engineers were busy. It took nearly two hours to locate the cause of the failure, and the failure was not restored until 18:04 in the afternoon. The outage lasted 4 hours and 25 minutes, causing a total of approximately 30.6 million SoftBank users to be unable to communicate normally. It was a rare major communications accident in Japan's communications history. After the accident, SoftBank executives publicly apologized to users and promised to strengthen equipment backup management in the future to prevent the accident from happening again.
Since the failure occurred during the day and had a wide impact, it had a huge negative impact on SoftBank, causing its stock price to plummet and more than 10,000 users to terminate their contracts within 5 days. I guess even the Japanese Ministry of Internal Affairs and Communications was shocked. It was not until 20 days later, that is, today, that it officially announced and confirmed that it had "received a serious accident report submitted by SoftBank." The following is from SoftBank’s outage report… summary Time of occurrence: December 6, 2018 13:39 to 18:04 (4 hours and 25 minutes) Impact content: • Voice calls and data communications are not available on 4G LTE mobile phones. • Some LTE fixed lines and home Wi-Fi are not working properly • 3G network congestion due to 4G network failure Impact: Nationwide (about 30.6 million line users) Cause: This is caused by a software defect in the 4G core network equipment (MME). Failure Cause Analysis The specific cause of the failure is that the digital certificate (TSL certificate) of the core network element MME (mobile management entity), that is, the 4G packet switching equipment, has expired. TLS (Transport Layer Security) is a security protocol that provides security and data integrity for network communications. SoftBank explained that they have deployed a total of 18 packet switching devices in the two major central computer rooms in East Japan and West Japan. These devices are configured according to long-term needs and have sufficient load redundancy. Currently, only 30%-40% of the load is used. At the same time, the 18 devices back up each other and are all deployed in a pooled manner, which means that even if any one or even multiple devices fail, it will not affect the normal provision of services. However, the situation is different when a digital certificate expires. The expiration of the TSL digital certificate means that the system cannot identify whether other devices connected to the packet switching device are legitimate. At this time, the system detects an abnormality and, based on SoftBank's current network settings, will use a restart to try to recover. However, the expiration of a digital certificate cannot be restored even after N restarts. Therefore, an endless loop of restarts occurred, leading to this major failure. In addition, due to the interruption of 4G network services, a large number of users switched to 3G networks, which also caused serious congestion in the 3G network. The digital certificate has expired Why wasn't it discovered earlier? SoftBank explained that the digital certificates for the packet switching equipment are different from those for other network equipment. Usually, for other network devices, we can confirm the expiration date of the digital certificate ourselves after purchasing the device. However, the digital certificates of packet switching equipment are fixed in the corresponding hardware through embedded software. As an operator, we cannot confirm the expiration date. Solution Temporary solution The failure was caused by the Ver.1.14 version upgraded in April 2018, while the previous Ver.1.08 version had no problem. Therefore, the temporary solution was to roll back from Ver.1.14 to Ver.1.08, but this would cause some 4G IoT functions to be unusable. Mid-term solutions 1) Conduct a network-wide survey of all equipment to see if the relevant certificates have expired, including all base station equipment in the network. 2) Develop more stringent network access testing specifications for new equipment and new software versions. 3) It is required to retain the old version of the software within one year after the equipment is upgraded so that it can be quickly rolled back to the old version if similar problems occur in the new version of the software. *** measures 1) Operators are required to check whether the digital certificates of all purchased network equipment and software have expired. 2) Change the system anomaly detection and emergency response mechanism. When the system detects a network anomaly, it no longer just restarts to recover, but sets an abnormal alarm level and determines whether to restart or continue running based on the threshold. 3) Since one of the causes of this major accident was that all equipment came from the same supplier, it is required to introduce multiple equipment suppliers before June 30, 2019 to spread the risks. After reading SoftBank's fault report, I felt that there were thousands of "never expected" between the lines. Although all kinds of backup and disaster recovery were in place, the accident still happened. It is true that network security is no small matter, and the responsibility of operation and maintenance is as heavy as a mountain, which is a wake-up call. |
<<: You must know the five common misconceptions about HTTPS
>>: Simple and clear, the most powerful introductory 5G science popularization ever!
The process of transferring data remotely involve...
[[380517]] On February 3, the Ministry of Industr...
Recently, New H3C Group, a subsidiary of Tsinghua...
September and October are the golden months. Now ...
A new report from IDC predicts that global privat...
HostSlick recently released some special packages...
DevOps has transformed the workflow and tradition...
Standard Interconnect is a Chinese hosting compan...
[[375750]] This article is reprinted from WeChat ...
DogYun (狗云) has announced this year's 618 Hap...
Operators have made huge investments in 5G RAN, w...
Since the advent of SD-WAN technology five years ...
Network monitoring is one of the most important n...
CDN is usually a large number of distributed syst...