Failures happen every year, but this year they happened especially early. A few days ago, when I was writing "Top Ten Things to Watch in the Communications Industry in 2022", I mentioned the importance of network security and stability. Unexpectedly, within less than a week, the first Internet outage in China occurred. According to media reports, a large number of users reported that a certain operator's service was disconnected in the early hours of this morning. Based on the coordinates reported by netizens, the network was disconnected in Beijing, Guangzhou, Hangzhou, Changchun, Urumqi and other places. It is reported that this large-scale Internet outage caused 47% of the operator's users to be unable to access the Internet and the backbone network was affected, with the specific manifestation being that there was no information in route tracking. The Internet outage occurred late at night and was restored quickly (in less than half an hour). Objectively speaking, the impact was not significant. However, the reason for the outage has not been announced yet (and it is estimated that it will never be announced). This network outage makes people think deeply - it's 2022, and operators and equipment manufacturers are boasting about their technology every day, so why do our networks still get disconnected? Today, Xiaozaojun will briefly talk to you about this topic. Analysis of the causes of network disconnectionSome time ago, when the Xi'an Yimatong system was down, almost everyone focused their attention on "single source procurement" and "project subcontracting". This is easy to understand. Although China has been strengthening the construction of the legal system since its reform and opening up, it has never been able to completely eradicate corruption and gray transactions. In our daily work, we often see and even participate in similar projects - winning the bid at a high price, then subcontracting at various levels, and finally a project worth tens of millions is likely to be completed by a team of college students. This phenomenon is not uncommon in the software industry, so people may wonder if the problems in the communications industry are also related to project subcontracting or low-price bidding? In fact, the situation in the communications industry is somewhat different from that in the IT industry. Communications projects, especially public communications network projects for the three major operators, have extremely high requirements for security and reliability. Public communication networks (mobile networks, broadband networks) carry a large number of users, support the development of various fields of the national economy (finance, manufacturing, transportation), and are of great significance to social stability. Each operator is subject to very strict network security operation assessment indicators. Once a problem occurs, it is not as simple as a salary deduction, but the leader will be dismissed or even imprisoned for dereliction of duty. Therefore, operators are very rigorous and serious in purchasing the main equipment and core services of communication networks. The successful bidders are large companies such as Huawei, ZTE, Ericsson and Nokia. Operators attach so much importance to it, not to mention equipment vendors. In the current fierce market competition, equipment vendors, no matter how stingy they are, dare not relax on safety. Once something goes wrong, the top leaders will rush to apologize. Once something serious happens, the share of this province is basically gone, and they will pack up and leave. Therefore, the procurement of communication network equipment, especially the centralized procurement at the group level, has limited room for fraud. At present, the procurement of small and medium-sized enterprises is more likely to be fraudulent, or to have low-price competition and gray transactions. You can go to the procurement website of the operator to see that there are dozens of procurement projects posted every day, such as office building decoration, display procurement, information system operation and maintenance, and so on. Many people in the communications industry complain about price cuts by operators. In fact, the projects for which operators cut prices are mainly manual labor projects such as operation and maintenance, maintenance on behalf of operators, and site surveys. Party A will use construction standards to control suppliers. The ability of the supplier to provide low prices and meet standards depends on the supplier's ability. The hardware and materials are clear and it is difficult to cut corners, so the supplier focuses on the employees and greatly deducts the wages and bonuses of the partner's employees to achieve the goal of lowering costs. You may not believe it, but some equipment manufacturers clearly require the salary ratio for outsourced employees when recruiting outsourcing cooperation. For example, if the equipment manufacturer pays 10,000 to the subcontractor, then the subcontractor must promise to pay the employee at least 7,000 to ensure the enthusiasm and attitude of the grassroots employees. In order to ensure that subcontractor employees do not mess around, equipment manufacturers and operators have also specially formulated a large number of process systems and behavioral norms, and documents are constantly checked and the operating procedures are strictly controlled to prevent accidents. The failure in Guangxi a few years ago was caused by the misoperation of outsourced employees, which resulted in a certain equipment manufacturer losing hundreds of millions of yuan. Therefore, equipment manufacturers will not cut costs in the core links just to save a few dollars. All in all, what I want to say is that the possibility of a major network failure due to subcontracting and cutting corners in the communication network is extremely low. Operators, equipment manufacturers, and subcontractors dare not take network security lightly. The real reasonSo the question is, what is the main reason for major failures in communication networks? In fact, it is still a technical reason. We who are engaged in communications technology know that today's communications networks are extremely robust, and it is difficult to deliberately paralyze them. When the communication network was first designed, countless experts conducted architecture design and review, considering various redundancy and disaster recovery solutions. To avoid system failure, all boards are equipped with two main and backup modules. Further up, network elements are also disaster-tolerant, either with pools, 1+1 or 1:1 backup. Not to mention transmission equipment, various ring networks and primary and backup protections are designed to cope with equipment failures or unexpected situations (earthquakes, floods, terrorist attacks, etc.). Electronic devices are unstable. CPU, memory, motherboard, hard disk, strong and weak electricity, all may fail. Public communication networks must have disaster recovery backup to achieve a reliability of more than 99.9999%. To put it bluntly, it is a waste of money. It looks like a set of equipment, but in fact it is a bunch of equipment. However, the more complex the network, the harder it is to detect hidden dangers. Currently, we are experiencing the development of 2/3/4/5G, and the network has become too bloated and complex. The openness of the network has also led to a mixed bag of manufacturers. Old equipment is reluctant to be eliminated, and new equipment (new technology) has just been put into operation, which is a period of high incidence of chaos. The operation and maintenance personnel's lack of understanding of equipment and networks and information asymmetry led to panic and haste in responding to emergencies. To be honest, some employees of operators are currently unable to update their technical skills in a timely manner, are becoming increasingly dependent on equipment vendors, and are losing control and dominance over technology. A small number of operators' grassroots technical experts, due to career development, were either promoted to management positions, or laid off or resigned, leaving a gap in the succession and unable to communicate on an equal footing with equipment vendor engineers, which affected emergency recovery of faults. It was good enough that no secondary damage was caused. Major network-wide failures are either caused by the core network or the transmission network. If a failure occurs now, it is obvious that it is either the backbone network routing failure, the optical fiber failure, or the failure of basic services such as DNS and authentication. Even top technology giants such as Facebook and Google have stumbled over basic routing protocols such as BGP. What else is impossible? We always brag that we control the Internet. In fact, front-line technicians know that many technical matters are metaphysical. You have no idea why it works well, and you have no idea why it goes bad. There are too many possibilities for network failures, and the butterfly effect is also very obvious. Our domestic projects have strict quality control, so it's better. Many overseas projects are simply maddening. For example, in a project in India, a local employee connected the trunk line incorrectly. The primary line was correct, but the backup line was wrong. As a result, the transmission network was disconnected, and the primary line was switched to the backup line, causing the network to crash. The line crashed, but the data overflowed and crashed the transmission equipment. The crash of the transmission equipment caused signaling congestion in the entire network, and then crashed the MGW media gateway (which also served as a signaling gateway). The crash continued, and the entire state (equivalent to a province in China) was disconnected from the Internet. Isn't it amazing? As technical people, we should have awe for technology. Our mastery of technology is far from perfect. Therefore, it is impossible to completely eliminate network failures. If you walk by the river often, you will get your feet wet. If you want to make a living, you still need some luck. From the perspective of long-term technological development, people are now talking about autonomous driving of the network (it has nothing to do with driving, it means the network manages itself) and intelligent operation and maintenance using AI. In fact, I think AI-assisted operation and maintenance should be feasible, but it is still quite far away to fully take over. At present, our communication network is too complex and the personnel level is uneven. In the absence of external active attacks, we cannot guarantee 100% security of the network. Once the hostile forces launch an unrestricted war against the network, no one knows what will happen. We cannot control objective situations, but we can still take subjective preventive actions. On the one hand, strengthening respect for technical personnel, providing appropriate treatment, and planning the runway of technical lines will help stabilize the technical personnel team. On the other hand, timely technical training and practical exercises for employees can make up for the technical differences and facilitate the rapid recovery of faults. Third, disaster recovery drills must be implemented in practice, with fewer tricks. It will be of great help to find ways to design more extreme emergency situations and improve disaster recovery plans. Fourth, simplifying network architecture design, accelerating the elimination of old equipment, and achieving network simplification will help reduce the risk of failures. Well, the above are some of Xiaozaojun’s thoughts on network failures. Everyone is welcome to add to and comment on them. 2022 did not have a good start. I hope everyone will be safe and sound in the future. We should pray more. |
>>: Can the United States make China disappear from the Internet?
Having said that, the 5g era has been here for tw...
IDC——Innate Investment Gene As social division of...
Beijing, March 10, 2021 - Denodo, a leader in dat...
Hello everyone, I am Xiaolin. Today, let’s talk a...
PacificRack has released several discounted VPS p...
[[394613]] On April 20, China Mobile announced it...
I have calculated this once in an old article, bu...
Preface When answering questions about computer n...
The Internet has evolved tremendously over the pa...
[51CTO.com original article] At 9:00 am on May 22...
spinservers launched a special promotion during t...
The tribe has shared information about DiyVM many...
[[394922]] This article is reprinted from the WeC...
After ignoring electric cars in the field of new ...
Since the country launched the pilot business of ...