Hello everyone, I am Lao Yang. I have said many times before that troubleshooting is a required course for every network engineer. Many people are clumsy at the beginning, but it is relatively rare for their clumsiness to cause serious problems. Today, one of Lao Yang’s fans shared with me a “tragic situation” he witnessed during troubleshooting at work. After reading the whole article, Lao Yang felt that it was very instructive and could be used as a reference for novice network workers. There is one thing he said that I agree with: "Failures caused by problems with the software or hardware of the device itself are actually relatively rare." “Most of it is man-made.” Who could be more exaggerated than me in sorting out the errors like this?This is a network failure I encountered in my previous work. I think the most important point in troubleshooting is that after you discover a network failure, you must have a basic understanding of the existing network environment. You need to determine its networking and configuration. You need to collect this basic information before you can analyze the fault. So let me first explain the background of the failure and give you some clues so that you can make a judgment together. The background of this failure is actually a very simple configuration change. During the lunch break, my colleague made configuration changes to the aggregation switch in a building in the factory. He actually wanted to add an access switch because the number of people in the factory increased, and the number of terminals also increased, so the existing network and access equipment could not meet the demand, and a new access layer switch was needed. As we all know, adding a new access switch to the existing network is very simple to configure. Generally, you only need to install the access switch on the rack and connect two optical fibers to the aggregation switch. What configuration do you usually need to do when you connect an access switch to an aggregation switch? Generally speaking, you just need to configure the VLAN required by the access switch, and then connect the access switch to two interfaces of the aggregation switch, or configure one interface as a Trunk, and you're done. The links between switches need to be configured as trunks. This is very basic content in CCNA learning, so I will not repeat it here. Then, in order to ensure the reliability of this link, you need to configure a link aggregation from the access switch to the aggregation switch. This is also a relatively basic network learning content and the configuration is also very simple. Originally, according to the progress of the work, my colleague returned to his workstation to rest after flashing the configuration, and the fault did not occur immediately. Because it was lunch break and no one was working, no one noticed the problem. At 2 o'clock in the afternoon, when people were using the Internet, they realized, "Ah? The floor switch has a problem!", or "I can't access the Internet!" Because for end users, the most intuitive experience is that the computer cannot access the Internet. What if you can’t access the Internet? Just file a complaint. So I called tech support. In fact, the technical configuration changes were very simple. It was just a new device and some new configurations, but the network failed at this time. If you encounter this situation, how would you analyze the failure?Let’s take a look at its network architecture. The network architecture is actually a traditional three-layer architecture. The core is connected to the aggregation, and the aggregation is connected to the floor access switches. Simple, right? It just means that its network architecture will use stacking technology. When its core switch is connected to the aggregation, the aggregation is done by stacking two switches. Leaving the core switch aside, the two aggregation switches are stacked. After the stacking is completed, the access switch has two upstream links, which are connected to the two aggregation switches respectively. Although there are two aggregation switches here, the aggregation switch is logically one device because it has link bundling on the two links. Link bundling is also a basic stacking technology. The logical topology is shown in the figure. Is this topology a loop-free network? Because the stacking technology is used, such a topology is actually a loop-free network. The aggregation is two devices, so the access switch only needs to configure VLAN on the downstream interface to connect to the downstream terminal - PC, just make a VLAN division, and then configure link aggregation on our upstream interface, and configure this upstream link aggregation port as a trunk port. These are the three simple configurations. Why do these three simple configurations cause a loop in the network? Of course, I didn’t know it was a loop at the time. Because my colleague thought that this network had no loops, and he used stacking and configured link aggregation, so there would be no loops in such a network... So his first reaction at that time was not to suspect the problem with the loop. Now let’s look at our first clue: a loop-free network. And the second clue: the configuration is very simple, only VLAN, trunk and link aggregation are configured. Have you analyzed anything?Let’s look at the third clue: the device cannot be logged in remotely. How should we understand this? In fact, when a network failure occurs, what is the first step people usually take? You must log in to the network device remotely to check the configuration to see if it is correct. Because the configurations are all flashed, the first step is to remotely log in to the device. You can log in to the aggregation switch or the access switch. But when I tried to log in, I found that I couldn't get in. There was no way to log in to these two devices... What should I do if I can’t log in?What usually happens when the device cannot be logged in remotely? It cannot be pinged and cannot be talented. I think some of you have encountered this at work? Usually this happens. If the device cannot be pinged, you just need to ensure that there is no problem with the routing. If the device cannot be pinged and cannot be managed remotely, it means that its CPU is full. Because all messages that need to ping this device and directly access the IP address of this device need to be processed by the CPU. When pinging or talenting a device, all remote messages need to be processed by the CPU. If the CPU reaches 100% at this time, remote login and management will be impossible. At this time, my colleague finally began to suspect that it might be a network loop... Because in the Layer 2 network, if the CPU usage reaches 100%, it is most likely caused by a loop, so it can be basically determined that there is a loop. But why? Why is there a loop? I named the configuration without loops! Where does the loop come from?What should I do at this time? I can only go to the scene. I happened to be at the scene at that time, and since this problem suddenly occurred at noon that day, my colleague asked me to go over to help. So I went to the scene with them to see what the situation was. Generally speaking, you need to bring a console and a notebook when you go to the site. Go to the device site, usually connect to the device remotely or directly through a console cable, and log in to the device to view the configuration. But everyone also knows that since the device CPU is 100% overloaded, if you log in to the device through the console line, it will be very slow. Basically, it will take a few seconds for you to get a response after typing a command. So generally when you encounter a loop problem, the CPU of the device itself is full, and since it is difficult to configure and view the device, it is also difficult to troubleshoot. So I used the dumbest method at that time - pulling the wire. On the aggregation switch, unplug the cables one by one from all ports that are connected to the aggregation switch. Finally, when the newly added device was unplugged, it was found that the network was restored. At this time, the newly added switch device could basically be located, and it had a problem. The configuration is so simple, but a loop occurs. It's outrageous. What is the only scenario in which a loop could occur in this place? I will tell you directly here that this is because the link aggregation configuration is incorrect. When they configured the link aggregation, they only configured one end and only flashed the link aggregation configuration on the aggregation switch. It can be seen here that the two ports of the aggregation switch are bundled together, but the link aggregation of the downstream access switch is not configured, that is, it has two independent ports. So when I unplugged the cable and the network was restored, I logged into the device to check the configuration. This check made me laugh and cry, I found it was very simple, I just missed two configurations, two link aggregation commands under the physical interface, it was that simple. Such a simple problem caused the network loop, and the spanning tree was not effective at this time. In fact, it also enabled the spanning tree. He was very puzzled, "Why is there still a loop when I enabled the spanning tree? Is there a bug in your device? Is there something wrong with your device?" So why doesn't spanning tree work?Because for the aggregation switch, there is only one port; However, for the access switch, there are two independent ports. The spanning tree BPDU message sent from the access switch is sent to the upstream interface, and the aggregation switch will no longer send it out from the interface. Because there is only one port, it will not be sent out from this interface, so the device itself will not think that there is a loop in the network. As a result, the spanning tree is useless at this time, and loops cannot be detected even if you turn on the spanning tree. Later I reconfigured the link aggregation and the network was restored. In fact, it was just a very simple oversight. Two commands were missed when refreshing the configuration, which led to the network failure. In fact, more than 50% of network failures in the existing network are caused by human error and configuration changes. Failures due to software or hardware problems in the device itself are relatively rare. Most of them are man-made, either due to unreasonable configuration or unreasonable planning, which leads to problems of one kind or another. Therefore: seriousness + solid basic skills = good network worker. In the early days of their work, I think many network engineers would have encountered similar situations, making one error or another that shouldn’t have happened. In fact, it doesn’t matter. As long as you master the theoretical knowledge of network and execute every command carefully, many problems can be solved easily. For a network loop like mine, the troubleshooting steps are actually very simple. For loops, if there is no way to log into your network device, you can only use the dumbest method of unplugging the cables one by one, because the device cannot be viewed and it is very stuck. The above is the troubleshooting experience I shared today. I hope it can give you some inspiration. |
<<: Networking in Pictures: What is Virtual Router Redundancy Protocol (VRRP)?
>>: What are the differences and connections between 25G/50G/100G technologies?
Digital transformation has increased the importan...
On February 20, South Korea announced the officia...
On November 10, the Ministry of Industry and Info...
[[432311]] Preface Hello, my friends! Dabai has w...
It has been more than two years since the country...
CMIVPS yesterday launched a 50% discount on the a...
[[271457]] Dong Tao, senior operation and mainten...
Leifeng.com: To understand cellular technology, y...
[[350382]] At 14:00 on the afternoon of October 3...
Failures happen every year, but this year they ha...
Everyone should be familiar with network slicing....
We have shared edgeNAT several times in the tribe...
The hijacking we encounter in daily life is usual...