A brief discussion on SD-WAN troubleshooting

What do you do when your SD-WAN has a problem or you suspect it’s causing application problems? Troubleshoot, of course.

[[253236]]

But SD-WAN troubleshooting requires IT teams to have a very good understanding of the network devices, connections, and topologies they are dealing with, as well as many other factors. Here are some helpful monitoring and practical troubleshooting steps that IT teams can follow when dealing with SD-WAN issues.

The first step in SD-WAN troubleshooting is to understand when the network is not functioning properly. In most cases, monitoring an SD-WAN is not much different than monitoring a regular network. Physical components are usually the easiest to monitor: they either work or they don't. Logical functions can be more challenging because abstractions can make multiple network links appear as if they are one.

Monitoring SD-WAN

1. Event handling.

The most useful element of a good network management architecture is to examine events from network devices, including SD-WAN devices. Think of events as the network letting you know that something noteworthy has happened. The process does not require polling, and it can scale as the network grows.

I prefer syslog events to Simple Network Management Protocol (SNMP) traps because they do not require a specific management information base to be loaded into the management system to view the details. IT teams should configure SD-WAN devices to send events to a common event processing system where they can be stored, correlated, and acted upon.

Organizations with limited budgets can use open source collectors such as syslog-ng, as well as various analysis tools to summarize the large number of events that the network can generate. Organizations with budgets can look into the ELK stack - Elasticsearch, Logstash, and Kibana. If you need vendor support, there are vendor-supported versions of ELK, equipment vendors, and log processing vendors.

Event handling systems should be configured to automatically generate trouble tickets or send real-time alerts to the IT organization when critical events are detected. All events should be reported in daily or weekly summaries to ensure that missed events are eventually seen - for example, it would be good to know that half of a redundant design is not working.

2. Active link test.

SD-WAN uses multiple links to provide reliable end-to-end service. Active link monitoring allows the system to verify the success of the SD-WAN in providing the required reliability. Multiple tests may be required to verify the paths for different types of traffic, such as real-time data versus batch data. As the number of SD-WAN sites increases, ease of deployment is critical to a successful implementation.

Make sure the test is configured to simulate actual application traffic, including packet size, transmission rate, and quality of service markings. An advantage of active link testing is that it can detect problems outside of normal business hours when there is no application traffic. Active link testing simulates realistic application traffic and tests the entire end-to-end system, including link selection.

IT teams can use this type of test during a proof-of-concept evaluation by disabling each WAN link and monitoring how the test results change. This is particularly useful for determining how well an inexpensive broadband link can handle high-priority or real-time traffic when a low-latency path is down. Configure the test to always run so you can also get a sense of how likely an application is to run at different times of the day. You may also want to know the performance level when other applications are running — like backups or database synchronization, or when the broadband network is busy.

3. Physical state.

SD-WAN appliances are typically based on x86 systems with internal CPUs, memory, interfaces, power supplies, and cooling. Network events (usually syslog) should report problems with these components. Monitoring with SNMP can provide additional data about the use of these resources and provide answers to questions such as:

How many buffers to use on each path?
Is the CPU saturated at critical times of the day?
Is the power supply functioning properly, or is the AC mains input fluctuating beyond the specifications that the power supply can handle?

The default configuration for parameters such as buffering is usually correct, but sometimes you need to be able to modify the number of buffers to suit the functional characteristics of your application, such as processing a large number of very small packets. Make sure you can modify the queue depth as needed.

You should verify that the SD-WAN controller provides alerts and reporting when there are issues with the physical link. It should be able to detect flapping links, interface errors, packet loss due to congestion and duplex mismatches, which are still a common problem, so use auto-negotiation whenever possible. Use daily or weekly reports to identify alert issues that may have been overlooked.

4. Topology diagram.

Understanding the topology is important when troubleshooting, but manually updating the topology map is a time-consuming and error-prone process. Look for SD-WAN control systems that provide dynamic mapping of the physical and logical topology. A baseline is like a true network source of the SD-WAN physical topology, and understanding the difference between the actual state and the desired state can make SD-WAN troubleshooting much easier.

Identify the problem

The key to troubleshooting network problems is to be methodical. Start at one end and work toward the other, or use a divide-and-conquer strategy. Based on the symptoms, determine the type of problem that may exist. The Open Systems Interconnection model makes it easy to determine the type of problem and direct troubleshooting in the right direction, for example:

Physical problems, such as failed interfaces;
Link problems, such as duplex mismatch;
Routing issues, such as some destinations being reachable and single-hop tests being successful;
Application issues, such as a firewall or maximum transmission unit (MTU) mismatch.

If some of the data passes the test, the lower-level functionality is likely working properly, so you can focus your work on the higher levels.

SD-WAN Troubleshooting Steps

The analysis of the problem usually includes the following points:

Verify basic functionality of the SD-WAN node. This step checks the CPU, memory, and interface connectivity. The node should be able to communicate with the controller and download its configuration.
Check basic interface functionality. The required interfaces should be up and communicating with the device at the other end of the link. Basic connectivity should be established with the SD-WAN controller in order to download its configuration.
Verify VPN functionality. SD-WAN products create a logical VPN overlay on top of the physical topology. You need to understand how the VPN's encryption process works, how it can fail, and how to verify that it is working properly.
Integration with the overall routing architecture. SD-WAN devices enable multiple links to function as if they were one link. Network reachability at each site needs to communicate with other sites without impacting the overall routing architecture - i.e., no routing black holes, routing loops, or unreachable subnets. You need to understand how route distribution works and how to troubleshoot it.
Verify forwarding policies. Are packets taking the appropriate paths between SD-WAN devices? SD-WAN devices measure latency, packet loss, and jitter between them and use policies to determine which link each application should use. When a link fails for an application—or it falls outside the specification for that traffic type—traffic is moved to another link, which can affect the applications that moved, as well as the applications that used the links that are still functioning. This analysis may require some low-level commands to access detailed data.

The command line interface is useful when you need low-level details. These commands will include show commands for checking system status and test commands such as ping and traceroute. Learn how to apply them to the testing of individual links as well as application flows.

Packet capture techniques may be necessary to diagnose problems with an application that would otherwise be incomprehensible. Wireshark's TCP Sequence Space Plotting feature is a useful tool that relies on packet capture files.

WAN Operator - Link - Problem

You need to understand the link characteristics of packet loss, latency, and jitter. Do they comply with the policies you define? Does the link perform according to the service level agreement (SLA) defined by the link provider? An MPLS link may have an SLA, while a cheap broadband link may not.

A divide-and-conquer approach may be necessary here. Selectively enable only one physical link at a time and verify that the link is functioning properly. Then, try combinations of links, eventually getting to a point where all links are functioning. Don't forget to check that the policy is correct. Link characteristics may change, rendering those links unacceptable for any policy.

A good approach is to generate a weekly report on link characteristics and usage. For large SD-WAN implementations, the report itself would be too large to be useful, so filter the results to show only those links whose characteristics do not match any policy.

Check for MTU mismatches. Applications that use small packets may work, but those that require larger packets may not. If ping and terminal connections succeed, but file transfers, backups, and database synchronization fail, consider an MTU issue.

Duplex mismatch. Check the interface statistics to determine if there is a duplex mismatch, even if you cannot check the configuration of each interface on the Ethernet link. Full-duplex interfaces will show runt packets received, and half-duplex interfaces will show late collisions. These counters should contain small values and increment on an active link if there is a mismatch.

in conclusion

Troubleshooting is half art and half science. I recommend learning how a specific SD-WAN product works and what SD-WAN troubleshooting tools exist during the initial proof-of-concept phase. Create a simple text document that describes the basic steps to take for a specific SD-WAN vendor. This will simplify the SD-WAN troubleshooting process when problems arise in the network.

Original link:

https://searchnetworking.techtarget.com/feature/A-deep-dive-into-SD-WAN-troubleshooting-and-monitoring

<<: Risks and opportunities in the 5G era

>>: In-depth analysis of SDN switch configuration and application issues

spinservers: $99/month-E3-1280v5/32GB/1TB NVMe/1Gbps unlimited traffic/Dallas data center

DediPath July Promotion: 50% off all VPS/HybridServers/Dedicated Servers, multiple data centers in Los Angeles/San Jose, etc.

Blog

Practical knowledge: Types and advantages and disadvantages of wireless network topologies

Blog

The Importance of PoE in Surveillance and Remote Security Systems

Blog

RAKsmart popular cloud server 10% off annual payment from 79 yuan, bare metal cloud/RAK Cloud/VPS 30% off, Hong Kong/Japan/Los Angeles/San Jose data center

Blog

Exploring cross-industry collaboration between 5G and edge computing

Blog

Understand the difference between disaster recovery and backup in data centers in three minutes

Blog

The 5G process will not be interrupted, and the short-term impact of the epidemic on the optical communications industry is controllable

Blog

Huawei Galaxy AI Network: Across the entire chain of the intelligent era, enabling transformation in thousands of industries

With the emergence of ChatGPT, we have entered a ...

A brief discussion on SD-WAN troubleshooting

spinservers: $99/month-E3-1280v5/32GB/1TB NVMe/1Gbps unlimited traffic/Dallas data center

Smart home, a bone that will eventually be chewed by NB-IoT?

What is the principle of communication? It turns out to be so simple

DediPath July Promotion: 50% off all VPS/HybridServers/Dedicated Servers, multiple data centers in Los Angeles/San Jose, etc.

Practical knowledge: Types and advantages and disadvantages of wireless network topologies

The Importance of PoE in Surveillance and Remote Security Systems

RAKsmart popular cloud server 10% off annual payment from 79 yuan, bare metal cloud/RAK Cloud/VPS 30% off, Hong Kong/Japan/Los Angeles/San Jose data center

Exploring cross-industry collaboration between 5G and edge computing

Understand the difference between disaster recovery and backup in data centers in three minutes

The 5G process will not be interrupted, and the short-term impact of the epidemic on the optical communications industry is controllable

Recommend

Learn more about the basic features of Linkerd 2.10 and step into the era of Service Mesh microservice architecture

In the data era, Ruishu Information helps operators build a security line of defense for application data with five tips

Many countries around the world are competing to deploy 5G using the NSA method. How did it become "fake 5G"?

Mobile device management in the new era of 5G LAN

Review of 2020丨Digital economy development has burst into surging momentum

Nokia's sales in the fourth quarter of 2017 were 6.7 billion euros, up 5% year-on-year

RAKsmart: Korean server/Japanese server starting from $59/month, 50-300M mainland optimized bandwidth

Taking multiple measures! Operators are preparing for 5G construction in 2021

CloudCone: Los Angeles VPS with large hard disk starts at $20 per year, regular KVM starts at $12.95 per year

IOFLOOD: $59/month-E3 1230v2, 32GB memory, 2x8TB hard disk, 20TB/1Gbps, Phoenix data center

Broadband as a Service: The End of DDoS?

V5.NET: Korea/Hong Kong dedicated server 30% off monthly payment starting from 325 yuan

DesiVPS: $15/year KVM-1GB/15GB/2.5TB/Los Angeles & Netherlands Data Center

Don't just look at the wireless router antenna. WiFi signal is only related to this parameter.

Huawei Galaxy AI Network: Across the entire chain of the intelligent era, enabling transformation in thousands of industries