I would like to say a few more words about this communication failure...

I would like to say a few more words about this communication failure...

​These days, everyone is paying attention to the large-scale communication failure of Japanese telecommunications operator KDDI.

The outage had a huge impact, affecting 39.15 million users across Japan. Moreover, the outage lasted for a long time, and it took almost two days to basically recover.

I have seen many public accounts have already written about the specific cause of the failure, so I will not repeat the analysis here.

In today's article, I would like to expand the topic a little bit and have an in-depth discussion with you - it is 2022, why are there still so many failures in our communication networks, and do we have an ultimate solution?

Communication failure: a century-long game

Failure is a natural attribute of communication networks. Just like people get sick, communication networks have been accompanied by failure since their birth. In other words, we created communication networks in the process of solving failures.

Father Bell invented the telephone after solving countless problems

For more than a hundred years, countless communications professionals have been fighting and negotiating with failures. They have worked hard to develop various technologies and adopted various means to fight against communication failures.

From a macro perspective, the effect of the struggle is significant. With the continuous accumulation of experience and the continuous advancement of processes and technologies, the probability of communication network failures is constantly decreasing.

Young readers may not know that 20 years ago, it was common for landline phones to not work (few households had phones), just like water and power outages. 10 years ago, it was also common for mobile phones to not work and the Internet to not work.

In the past decade, these phenomena have become less and less common. When they happen occasionally, people will find it strange. When the network is disconnected, many people's first reaction is that their mobile phone is broken or they are out of money, so they quickly restart or recharge. Isn't it?

In the information society we live in, communication networks are as important infrastructure as water and electricity. Our work and life, as well as the operation of all walks of life, are inseparable from communication networks.

Under this premise, telecommunications operators, as state-owned enterprises and as builders and maintainers of networks, will always put network security and stability first.

The Ministry of Industry and Information Technology has set strict assessment indicators for operators in terms of network stability. If a network failure occurs in a province or city, the top leader will definitely be held responsible, and his career will be in jeopardy.

The pressure from operator leaders will be passed on to employees, as well as to equipment vendors and outsourcing companies.

The market competition is so fierce now. Once something goes wrong, either the company will have to pay a huge amount of compensation or lose the market share of the province. This is a loss that equipment manufacturers and outsourcing companies cannot afford.

Therefore, the entire communications industry certainly attaches enough importance to the security and stability of communications networks. The key is still the issue of capability and execution.

Where exactly are the weaknesses of communication networks?

First of all, I would like to talk to you about the definition of security levels of communication networks.

According to different scenarios, the security of communication networks is divided into different levels, from low to high, namely home level, enterprise level, and telecommunications level.

Security level of communication system

The routers we use at home are all home-grade. The security and reliability of this type of equipment is very low, and it can break down at any time, which can easily cause network interruption.

Enterprise-level equipment refers to network equipment used within a company. Based on the size of the network and the number of users, enterprise-level equipment has higher security and reliability, and is less likely to interrupt service.

The requirements for telecommunications-grade networks are even higher. Networks such as China Mobile, China Telecom, and China Unicom need to provide services to hundreds of millions of users, and failures are not allowed to occur easily. Generally speaking, telecommunications-grade reliability must reach a standard of more than 5 9s.

The communication network that Xiaozaojun is talking about today refers to the public communication network provided by operators to the public, including both cellular mobile communication networks and fixed broadband networks. They are all telecommunications-grade.

The architecture of cellular mobile communication networks and fixed-line broadband networks is actually similar, with the main difference lying in the access network part.

The cellular mobile communication network is a wireless access network, and the access equipment is the base station. The fixed broadband network is a wired access network, and the access equipment is the PON equipment (passive optical network equipment, including optical modems).

Let's take the cellular mobile communication network as an example for analysis.

Public communication networks serve hundreds of millions of users, so they usually adopt a pyramid-level architecture, with the core network as the core, the transmission network (bearing network) as the backbone, and the access network as the limbs.

Everyone can see at a glance that the biggest weakness of this architecture lies in the core network and transmission network (especially the backbone network).

The core network is the management center, the heart and brain of the network. Once it fails, the entire network fails. Therefore, core network engineers (such as me back then) are in the most risky and stressful positions.

Core network room

The transmission network (bearing network) is the blood vessels and nerves of the communication network. If the peripheral part is damaged, it will only affect a small part at most. However, what if the cardiovascular and cerebrovascular vessels are damaged? That will also be completely paralyzed.

Optical transmission equipment

The failure of KDDI this time, the failure of DoCoMo in October 2021, the failure of the four major operators in the UK in 2020, and the failure of CenturyLink in the United States in 2020 are all related to core routers. To put it bluntly, there is a problem with the cardiovascular system, and the whole person (network) is paralyzed.

In contrast, the probability of a major problem on the access network side is very low. If a single base station is “downed”, it will affect at most a few hundred or a few thousand people, and the scope is small, so the complaints are controllable.

Base station equipment

If a large-scale failure occurs in the access network, it is most likely due to a software version problem or a hardware batch problem of the equipment vendor. The probability of this happening is extremely low.

What do communications professionals do to prevent failures?

So, in order to ensure the safe and smooth operation of the communication network and prevent failures from occurring, what methods have we communications people adopted?

(1) First, the top-level architecture design needs to be improved.

The architecture of a network is the foundation of network security. A good architecture should take into account performance and capacity, cost, security, and redundancy.

Here everyone must remember one thing: communication equipment is a complex product. No matter how you design or stack the materials, it is likely to fail. It is just a matter of probability and time.

For possible failures, instead of guarding against them too strictly, it is better to focus on what to do after the failure occurs.

Therefore, introducing a backup mechanism is the most effective way to deal with failures.

Backup mechanism

We have all learned "Probability and Statistics". If the probability of a device failing is 1%, then the probability of two devices failing at the same time is 1% x 1% = 0.01%. Is that right?

In order to ensure absolute security, the POOL networking method is used when designing the network architecture, as shown below:

Several devices together form a pool (POOL), each responsible for a business. If one breaks down, the others will immediately take over to ensure that the business is not affected.

There are usually two or more core devices, located in different areas of provincial capital cities, and are physically far apart.

In addition, when designing the network architecture, important equipment network elements are usually placed in the core computer room with a higher security level.

Core computer room

For example, the most important HSS in the mobile communication network, which is responsible for storing and managing user data (formerly HLR, which contains each user's mobile phone number, authentication data, business information, etc.), is stored in the core computer room of the provincial capital city. At the same time, maintenance personnel will regularly perform physical isolation and backup of data in different locations.

In recent years, due to geological disasters, wars, terrorist attacks and other factors, operators have even begun to do backups in other provinces.

For example, when Zhengzhou was hit by a flood last year, the core computer room was flooded and the HLR was out of service. So we urgently activated the HLR located in the capital city of a neighboring province to achieve temporary recovery of services.

Different levels of disaster recovery

(2) The second method is the underlying master-slave mechanism.

We just talked about the redundancy mechanism of the top-level design. When it comes to the computer room, racks, boards, and cables, there are also active and standby designs, which can be called the active and standby mechanism at the bottom level.

If you have been to a computer room, you will find that the frames on the cabinets are plugged with various boards, and these boards basically appear in pairs.

Front view of a manufacturer's 3G device

That is to say, there are usually two boards of a certain type.

The same goes for network cables and optical fibers. You hardly see single cables; they are all in pairs.

Front appearance of a manufacturer's 4G device

The reason for this is to back up each other. If one board fails, the other board can continue to work, ensuring that the business is not affected. At the same time, the system will alarm to remind the staff to replace it as soon as possible.

The same goes for power supply. All cabinet equipment in the telecommunications room must have at least two power inputs.

Multiple power inputs (one red and one blue for one channel)

In addition to the mains electricity, important computer rooms will also be equipped with emergency power supply equipment such as batteries, UPS, and generators.

Battery pack in the machine room

(3) Third, perfect management systems and regulations.

Technology is never the only factor that affects network security and stability. The biggest threat to communication networks is actually people, not technology.

Regarding this point, Xiaozaojun believes that every communications person will have the same feeling.

We have learned countless lessons the hard way in terms of management processes and systems, and engineering technical specifications.

Why do upgrade plans need to be reviewed repeatedly? Why are engineering specifications so strict? Why do we need to build a spare parts warehouse? Why do we need to double-check or even triple-check the cutover steps? Why do we need to arrange for guard duty after major operations? Why do we need to shut down the network during important holidays? ...

These are the experiences summarized from the mistakes made by our predecessors.

Always be wary of network failures

In addition to internal management systems and process standards, the country has also established increasingly stringent laws and regulations to impose penalties in response to the frequent incidents of deliberate destruction of communications networks.

Illegal construction that cuts optical fibers, deliberate destruction of base stations, and cutting of optical fibers will all be subject to legal sanctions.

Base station feeder lines cut maliciously

The underlying reasons behind communication failures

With a reasonable network architecture design, a complete master-slave mechanism, and perfect systems and regulations, why do so many failures still occur?

Next, let me talk about the deeper reasons.

The first point, which is probably the one that everyone agrees on the most, is the internal competition environment in the communications industry.

In recent years, malicious competition and low-price bidding have been prevalent. Equipment manufacturers and subcontractors have to compete for orders and maintain profits, so they can only try their best to reduce costs, such as product design costs, material costs, and construction material costs. More importantly, employee salary costs.

The continuous cost reduction will inevitably affect product reliability and engineering quality. Low wages have led to the loss of a large number of experienced talents. In order to complete the project, subcontractors can only recruit fresh graduates, send them to the site to work after simple training (or even no training).

These personnel lack the necessary training and practice, and their quality level and technical capabilities are insufficient, which becomes a major risk point.

It is possible that some extremely low-quality individuals, under severe oppression, would simply delete their databases and run away.

In the past few years, in order to ensure that front-line employees were not deducted from their wages, some manufacturers even signed contracts with subcontractors to restrict the bottom line of income for outsourced employees.

In addition to low-price competition, another important factor affecting the security of network operations is the increasing technological complexity.

The more advanced the technology, the higher the complexity and the lower the reliability. As technology evolves, operators' networks become larger and larger, and networking becomes more and more complex, which greatly increases the probability of problems.

The tidal effect of communication networks is very obvious. Sometimes the difference between off-peak and busy hours can be ten or even a hundred times. If an unexpected event (such as a disaster) occurs, the traffic volume will surge, and the difference may be a thousand times.

It is impossible for operators to design a thousand times more redundancy. Therefore, if there is no reasonable bypass design or threshold design, the probability of network congestion is extremely high. (Several major failures in recent years have been caused by signaling traffic congestion.)

At present, few people can fully understand the complex networking of operators. As time goes by, as personnel change, they become even more unfamiliar.

Communication networks are inherently a metaphysics, with all sorts of strange problems. Who dares to say that they can calculate every possibility?

The third potential network security risk, which is also the risk that Xiaozaojun is most worried about, is external network attacks, such as hackers, viruses and system vulnerabilities.

Nowadays, communication equipment is basically IP-based and cloud-based, the network is becoming more and more open, and some are directly deployed on the public cloud. The physical isolation from the outside world is becoming weaker and weaker, making it more vulnerable to attacks than before.

Today's attackers are much more sophisticated than before, and their methods are more diverse, posing a huge threat to the network.

Of course, operators and equipment manufacturers have also invested heavily in preventing cyber attacks.

Now, all manufacturers are paying attention to the concept of "security reinforcement". As the name suggests, security reinforcement is to plug system loopholes to make the system more stable. Operators will use third-party tools or hire third-party manufacturers to conduct security scans on existing network equipment to find security loopholes, and then require equipment manufacturers to make rectifications and plug them.

All for safety

This game of "the higher the virtue, the higher the evil" will continue for a long time.

However, I personally think that the current defense party has big problems in terms of personnel security awareness and technical capabilities. In the future, we will encounter more and more security incidents.

I hope that the relevant units and departments will not just pay lip service to safety, but actually spend some time to improve the quality of their personnel and strengthen training. Otherwise, if something goes wrong, it will be too late to remedy the situation.

Final Words

The failure of Japan's KDDI is not the first, and it will certainly not be the last. Communication network failures are like passing the parcel, and no one knows if they will be the next.

Now, manufacturers are proposing to introduce AI to take over the network to reduce the failure rate of the network. Some manufacturers are also doing grayscale upgrades (i.e. local upgrades) based on network cloudification, which can also significantly reduce network risks. These are all good trends.

I think we still have a long way to go in the fight against communication network failures. The road is long and arduous, and communication people should keep exploring.

Well, that’s all for today’s article.

<<:  5G, IoT, edge and cloud: a winning combination

>>:  Illustrated Network: The principle behind the TCP three-way handshake, why not two-way handshake?

Recommend

With this subnet division summary, I know all about subnet mask design~

1. Subnet Division Subnet division is actually th...

What are the main problems facing 5G networks?

5G networks are the next generation of wireless t...

It took two years for 5G messaging to be officially commercialized. Is that it?

With the development of science and technology, t...

ColoCrossing new bare metal cloud: $20/month-4 cores/8GB/120G SSD/20TB@1Gbps

ColoCrossing recently launched the Bare Metal Clo...

How to design a distributed ID generator?

Hello everyone, I am Brother Shu. In complex dist...