Practice: Ping suddenly has high latency? Spanning tree architecture, the Cisco switch that is highly praised by network engineers is also suffering from the old sin!

Practice: Ping suddenly has high latency? Spanning tree architecture, the Cisco switch that is highly praised by network engineers is also suffering from the old sin!

Background

Party A is a ship machinery parts manufacturing company that has been using a spanning tree redundant network architecture deployed with a full set of Cisco switches.

The root bridge is a core layer switch, which is only used for LAN communication. The access end devices are industrial cameras and collectors, and the data is transmitted back to the central control console. The entire network topology is as follows:

Network topology description:

  • The enterprise has multiple processing workshops, and each workshop network belongs to a different VLAN and is logically isolated;
  • The workshop switching network is aggregated and connected to the core switching network;
  • The workshop switching network is interconnected in a ring, the STP protocol is enabled, and the blocked port is the switch interface of the bridge link;
  • The core layer switching network also enables the STP protocol and multi-link redundant interconnection;
  • The terminal interfaces connected to industrial cameras, collectors, etc. are STP edge interfaces, and topology changes are not included in the calculation.

Problem Description

Recently, IT staff discovered that it was very slow to access the industrial camera in workshop B from the console computer. The delay between the industrial camera and the switch where it was located was generally around 20ms:

Note: The camera IP is 192.168.1.153, and the IP of switch No. 3 is 192.168.10.3

This problem did not exist for more than half a year, and the ping delay was stable ≤1ms. Recently, this fault suddenly occurred. The delay occurred at:

The problem seems tricky, let’s see how to analyze it!

Troubleshooting Analysis

Step 1: Check if key configurations have changed

In the design topology, we can see that the STP backup link is the wireless bridge backhaul link. As we all know, wireless latency is higher and more unstable than wired latency. Could it be that the link is switched to the wireless bridge for transmission? Check the configuration items of switches 3 and 4 here:

Because the STP root bridge is in the core layer, all the switches in the workshop are "non-root bridges", so each ring switch will decide a blocking port. Here, switches 3 and 4 form a ring, and the priority of ports 12 of switches 3 and 4 is higher than that of port 11 in the configuration (slightly better), so the blocking port will only appear on port 11 of switches 3 and 4, that is, the backup blocking link is the wireless bridge link, and the configuration is as expected.

Step 2: Confirm whether the spanning tree topology meets expectations

Confirm that the switch configuration is correct. The next step is to determine the STP topology convergence. Here we mainly look at the equipment in the "processing workshop B" problem point. Other areas do not need to be concerned for the time being. Command:

 show spanning-tree interface

View the status of the relevant Cisco switch ports:

You can see:

  • The 11th port of switch No. 3 that accesses the terminal is in the AP port blocking state, and the 12th port is in the DP port forwarding state:
  • The DP ports of switches 11 and 12 of the upstream core are in forwarding state.

This indicates that there is no problem with the topology convergence of the switching network, which is in line with expectations. This eliminates the possibility that the data is forwarded through the wireless bridge, which causes excessive latency. Next, consider whether the data is too high because it passes through the core layer network. Next, directly connect to access switch No. 3 for testing.

Step 3: Confirm the delay of the switch No. 3 where the industrial camera is directly connected

Connect the PC directly to switch No. 3 and ping the IP addresses of the switch and the industrial camera at the same time:

It can be seen that there is a delay in direct connection, and the terminal response and switch response delay are consistent. It is likely that there is a problem with the switch work and a "forwarding delay" is generated. To verify the delay, the next step is to capture packets to see the ICMP interaction.

Step 4: Capture PC interface interaction data packets

Open Wireshark on the PC to capture packets and find that the network is flooded with a large number of "UDP unicast messages", with a packet rate of nearly 10,000 packets/second and a throughput of 100Mbps:

This is very strange. The PC's own IP address is not 192.168.1.102, and the switch is not configured with mirroring. How can it receive the unicast stream sent by the industrial camera 192.168.1.153 to 102? Communicating with the site, 192.168.1.102 is the collector. The industrial camera will transmit the video back to the central control console on the one hand, and transmit it to the collector on the other hand.

From the above situation, there is only one root cause of UDP unicast flooding: collector 102 is no longer in the network, but industrial camera 153 has fixed the transmission destination IP and MAC. Even if the target does not exist, it will not affect the camera's streaming. Therefore, this UDP stream is "unknown unicast frame"! This frame will be broadcast and forwarded by the switch in the network!

Step 5: Confirm the collector is online

The cause of the problem is that the collector 102 is not online, causing the unicast stream of the industrial camera to become an "unknown unicast frame" flood, so the PC pings the collector to confirm connectivity:

Check the MAC table of switch No. 3:

It can be seen that the terminal does not exist in the network. The line may be loose or the crystal head may be aging.

Solution

Cause: Cisco switches flood a large number of unicast frames, causing their own forwarding delay to increase

  • The collector in processing workshop B was disconnected due to loose network cables and aging of the crystal head;
  • The industrial camera in processing workshop B still sends UDP unicast streams to the target with IP and MAC as collectors, with a packet rate of nearly 10,000 packets/second and a throughput of 100Mbps:
  • Since there is no UDP packet destination MAC entry in the MAC address table on Cisco switch No. 3, this flow is an "unknown unicast packet" and is forwarded according to broadcast flooding;
  • The Cisco switch may have performance issues or other unknown reasons. After broadcasting and flooding this huge amount of unicast frames, "forwarding delay" is generated, resulting in high delay when the PC accesses itself and the terminal.

Solution: Adjust the network cable and crystal head of the collector to restore the network connection

After restoring the collector online, you can see that switch No. 3 can learn its MAC address entry:

The "unknown unicast frame" becomes a known unicast frame, and the traffic is forwarded by the switch unicast. The network returns to normal and the latency decreases:

<<:  vivo HTTPDNS end-to-end experience optimization practice

>>: 

Recommend

7 Advantages and 4 Challenges of Hosting

Colocation, which involves placing IT equipment i...

How to lay the foundation for closed-loop automation

Today, many enterprises are digitally transformin...

5G will change society in the future: eight application scenarios

The epoch-making 5G technology, in addition to a ...

HTTP 2.0 is a bit explosive!

[[420793]] Hey guys, hello everyone, this is prog...

...