How to quickly troubleshoot data center networks

How to quickly troubleshoot data center networks

When the network scale of a data center becomes large, it is necessary to add network devices and implement multi-layer cascading. Today's data centers are often tree-shaped structures, with several devices with large forwarding capacity placed at the core, and then multiple layers of devices hanging below (due to insufficient port numbers, multiple layers may be required). Dozens or even hundreds of network devices are cascaded together. Once a fault occurs, how to quickly find the faulty device often troubles many network operation and maintenance personnel.

The network equipment in the data center is redundant. When a network failure occurs, as long as the faulty device is found and isolated, the service can be restored, and then the cause of the failure can be slowly investigated. However, it is not easy to find the specific faulty device among hundreds of devices. Network failures often get fault feedback from the application side first, and then start troubleshooting. At this time, the application personnel often only describe an application access failure phenomenon. They will not tell you which specific addresses are not connected to which addresses, and sometimes even wrong information, which greatly delays the problem location time. Most of the time for problem location is spent on the process of sorting out the fault phenomenon. What should I do? How can the data center network be quickly troubleshooted? This article will give the answer.

[[238068]]

If you want to analyze the network fault from the fault phenomenon reported by the application side, it is too late, and it is easy to be misled by the application personnel. Some application personnel report only the phenomenon they see, which is likely to be a local phenomenon and cannot reflect the fault of the entire network. Therefore, you have to rely on yourself, do a good job of network monitoring, discover problems through monitoring, and quickly find the faulty device, isolate the device or solve the fault.

Early network monitoring mainly monitored some logs and port traffic of devices. More often than not, this information was not enough and problems could not be discovered in time. Many network equipment manufacturers say that their equipment logs are very complete, but in actual use, there are still some extreme cases or software bugs that result in no log output when a fault occurs. At this time, it is necessary to locate the traffic. At this time, network personnel need to find application personnel to understand the fault phenomenon, find some packet loss or unreachable IP addresses on site, and then conduct network traffic, and conduct traffic on all devices through which the fault traffic passes to find the faulty device. Since it is a tree-shaped network, there are many devices at each layer, and the traffic volume is quite large. Moreover, not all devices can support statistics on all characteristic traffic. If there are unsupported devices, the statistics will be inaccurate, which increases the difficulty of finding faulty devices. This is how I have persisted in network operation and maintenance over the years.

Obviously, the previous network troubleshooting methods are effective but inefficient, take a long time to locate faults, and have a great impact on business. Today's network monitoring is all about data flow, monitoring specific data flows in the network, so that once the data flow is interrupted, the fault location can be immediately found. Here, we should mention several emerging network monitoring methods, also known as network visualization technology, which are the most effective methods for rapid troubleshooting.

  • The first is INT (In-band Network Telemetry) technology. INT monitors the network status by collecting and reporting the network status at the data level. When a data packet enters the first network device, the sampling method is set on the device to sample and mirror the service flow packet. INT encapsulates an INT header based on the packet and fills the switch information to be collected into the INT data segment. All network devices that the packet passes through are processed in this way until the network device connected to the first server strips off the INT header. Each device that the packet passes through sends the collected INT message to the remote monitoring server through the gRPC message for parsing and presentation. The INT message carries the delay of message forwarding, device congestion, etc., which can be presented to the monitoring server. Once the data packet is lost or unreachable, the monitoring server immediately senses it and can determine the scope of the problem and the faulty device in a few seconds.
  • The second is ERSPAN (Encapsulated Remote Switch Port Analyzer, a remote network traffic monitoring technology across three-layer IP transmission). ERSPAN's messages are based on GRE encapsulation and forwarded via Ethernet to any place reachable by IP routing. ERSPAN copies the source port message and sends it to the destination server for analysis via GRE (Generic Routing Encapsulation). The physical location of the collection server is not restricted. In this way, we can forward the key traffic of the entire network to the monitoring server through ERSPAN, and it is clear at a glance which part of the network has been discarded.
  • The third is sFlow and Netstream. Both are data sampling technologies. Netstream collects more complete data, but it requires dedicated hardware to complete. After deploying sFlow and Netstream in the network, the monitoring data can be sent to the server through gRPC, which is calculated and sorted by the monitoring server and the results are displayed graphically. Once there is a problem in any part of the network, it can be immediately displayed on the monitoring server. sFlow and Netstream collect the main features of the message header, not the entire content of the message. This is quite different from INT and ERSPAN. They can handle most network troubleshooting without any problems, unless the application message features are special and Netstream cannot capture them. In this case, you can only ask for help from INT and ERSPAN. In a network, it doesn't matter if all three monitoring solutions are deployed. In this way, when a fault occurs, you can analyze the problem from the data collected from multiple angles. Another important point is to try to send these data collections to the monitoring server through the management network. Otherwise, once there is a problem with the data network, the monitoring data may not be able to reach the monitoring server normally. In most cases, data network failures rarely affect the management network, and all devices can still be accessed normally. If many devices cannot be accessed through the management network during a failure, it can be basically determined that this device is the fault point.

With the above network monitoring methods, it is not difficult to find faults in the first place, and it can be fully automated. When a fault is found, the monitoring server automatically sends an isolation command to isolate the faulty device and automatically restore it. In this way, before the application reports the fault, the network fault location can be found, the faulty device can be isolated in time, and the business can be restored. This can greatly shorten the fault analysis time, have little impact on the business, and even the business part cannot perceive the fault at all. The actual application effect of network monitoring technologies such as INT and ERSPAN is still unknown. They are all technologies that have been mentioned recently and need to be tested in practice. SFLOW and Netstream technologies are relatively mature, but they are not used much in network troubleshooting, and they need to be promoted in this regard. Relying on these monitoring technologies, network faults can be quickly eliminated, which is of great significance to data center operation and maintenance, and greatly improves operation and maintenance efficiency.

<<:  Why choose NB-IoT when there are so many standards?

>>:  The United States will cut off China's Internet in a minute? This is a popular science article certified by the Chinese Academy of Sciences

Recommend

Simple test of HostYun Australian data center AMD series VPS

Earlier this month, the blog shared information a...

What the hell are the three-way handshake and the four-way wave?

[[382042]] This article is reprinted from the WeC...

A Comprehensive Guide to Fiber Optic Connector Types and Their Applications

Fiber optic connectors play a vital role in the w...

5 Fast-Developing Technology Trends in the Network Industry in 2017

At the start of every new year, experts and forec...

7 IT reorganization mistakes to avoid

The way IT operates is changing constantly and ra...

Current limiting is never an easy task!

[[354146]] This article is reprinted from the WeC...

The data is not real-time enough: try long connection?

background In certain scenarios, we often need to...

Changes to the Internet in 2018

There are already many articles in the industry p...

DigitalVirt: 95 yuan/year-1GB/10GB NVMe/1TB@200Mbps/Hong Kong International Line

DigitalVirt recently offered a 50% discount coupo...