About the Author Lightworker, a network technology expert at Ctrip, focuses on the fields of optical fiber communication and DCI transmission technology. 1. BackgroundOptical Transport Network (OTN) is a communication network based on fiber optic technology. It uses optical fiber as a transmission medium to transmit information in the form of light. With DWDM (Dense Wavelength Division Multiplexing) technology and protection switching technology, it can achieve large bandwidth, low latency, and highly reliable data transmission, so it is widely used in multiple data center interconnection scenarios. Large domestic and foreign Internet companies can greatly reduce the cost of data transmission between IDCs by renting operators' optical fibers to build their own transmission networks. Similarly, Ctrip also has its own optical transmission network (TOTN), which is mainly used to carry backbone network cross-data center traffic and IT office Internet traffic. As the underlying physical network, TOTN directly faces the operator's optical cables and needs to deal with frequent optical cable failures. As we all know, domestic infrastructure is still in the development stage, and operator optical cables are often cut by construction. According to statistics from the US operator Level3, its optical fiber network is interrupted once per thousand kilometers per year; China Telecom has more than 50 trunk optical cable interruptions per year; and in India, there are several or even dozens of interruptions almost every day. It can be seen that the number of optical cable interruptions is closely related to the level of local social and economic development. Since the establishment of Ctrip TOTN, an average of more than 20 optical cable outages have been detected each year. Therefore, while providing large-capacity transmission, if the optical network can automatically switch when an optical cable failure occurs, so that the service bandwidth is not affected and even the failure is not sensed, it will greatly improve network reliability. Figure 1 The optical cable was cut 2. Overall ArchitectureCtrip's transmission network is designed with dual planes and protection. Each IDC deploys two completely independent sets of transmission equipment, which are connected to two optical fibers with different routes to form two completely independent transmission planes. Figure 2 TOTN topology diagram Under normal conditions, services are routed on direct links. When the primary optical cable is disconnected, the transmission system will switch services to the backup channel. The switching time of the primary and backup channels complies with ITU-TG.783 and ITU-TG.841 standards and is less than 50ms. Figure 3 Optical network protection Figure 4 Service flow when optical cable fails The above protection mechanism can solve the problem of automatic service switching when the optical cable is disconnected, without bandwidth loss, and can resist the extreme situation of simultaneous optical cable disconnection at two locations. But at the same time, there is a problem that has been bothering us. That is, when the transmission is switched, there is flapping on the ports of the network devices at both ends, which causes corresponding errors in the business. 3. Problem AnalysisThe time it takes for a network device interface to go from down to up varies due to different devices and different optical modules, and the convergence time of the second and third layers of the network layer is uncertain due to different network architectures (usually considered to be a second-level interruption). Therefore, each transmission switch will cause a certain period of service unavailability. This is usually manifested in errors in sensitive services, such as Redis. As an in-memory database, Redis is very sensitive to network jitter and is aware of almost every optical cable interruption and switch. For example, at 12:00 on March 17, the optical fiber on the transmission plane A was disconnected, and packets were lost in the CSR in direction of the backbone network. Figure 5 Backbone network error For example, at 19:44 on September 11, the optical cable of plane B was disconnected, and Redis reported a large number of errors during transmission switching, as shown in the following figure: Figure 6 Redis error There has been no mature standard solution in the industry to solve the flapping problem of network device ports caused by transmission switching. Through research on other Internet companies, the more common solution is to configure link-delay on the switch interface, that is, after the router receives the link interruption signal, it delays for a period of time to set the link status to down. During this period of time, if the link is restored, the link status is kept up and no down status is generated, thus avoiding frequent link jitter. We also tried this method, but found that there were problems such as equipment not supporting and configuration not taking effect, and we could not achieve the expected effect. The reason is that link delay is not an IEEE standard, and network devices from different manufacturers support this function differently. For this reason, the distribution of transmission services can only be allocated to different optical cable routes to ensure that at least half of the services are not affected when the optical cable is interrupted, but this still cannot solve the problem of service perception. For example, if 200G service is required from end A to end Z, it must be allocated to two different planes, and each 100G service participates in the switching of its own plane. Figure 7 Business allocation diagram In addition, we found in our research that some companies set the delay to 2s in order to make the link-delay effective. Although this setting makes the transmission protection switching effective, once the protection mechanism fails, the switching at the routing level will lose 2s of valuable time. 4. Technical ResearchIn 2023, TOTN introduced a DCI product that supports 5ms switching. This product has improved the switching time of 50ms transmission to 5ms through two aspects. First, the magneto-optical switch is applied. The principle of the magneto-optical switch is to use the Faraday rotation effect to change the effect of the magneto-optical crystal on the polarization plane of the incident polarized light through the change of the external magnetic field, thereby achieving the effect of switching the optical path. Since there are no mechanical moving parts, the reliability is high and the switching speed is fast; second, by pre-entering the optical cable parameters of the backup channel into the DSP chip, the time of recalculating the parameters is saved when switching. Figure 8 Optical switch principle We hope to solve the problem of flapping ports of network equipment by shortening the switching time of optical switches. However, in actual applications, even if the transmission switching time has been compressed to 5ms, the ports of network equipment will still flap. After studying and debugging the product parameters, we found that when the optical cable is interrupted, the transmission optical layer will send an AIS signal to the electrical layer boards at both ends. After receiving the AIS signal, the electrical layer boards will send a Local_Fault alarm to the network device. When the network device receives the alarm, the port becomes down (IEEE 802.3ae). By setting the transmission system to delay the sending of the signal (default 4*50ms), as long as the transmission switching is completed within this time period, the signal will not be sent to the network device, so the port will not flap. Figure 9 Schematic diagram of fault signal transmission After the DCI product successfully achieved the switch-free perception, we hope to find similar parameters for adjustment in the existing traditional products. Because the alarm delay transmission is not related to the 5ms switching time, even if the switching time is 50ms, if the network device port can be made unaware of the optical cable jitter, it will greatly improve the service stability. V. Optimization planIn order to achieve seamless switching of traditional network products, through technical communication with the manufacturer, the conclusion reached was that the 100GE service mapping mode needed to be adjusted from BIT transparent mapping to MAC transparent mapping (which would interrupt the service), and then the alarm parameters needed to be set to delay transmission by 200ms. Since TOTN has never used MAC transparent mapping, we coordinated with the equipment manufacturer to conduct MAC mapping and BIT mapping test verification in the laboratory. The conclusion is that there is no difference in throughput between the two methods, but there is a difference in latency. When BIT mapping is used, the latency for frames of 64-9600 bytes is 24us, and when MAC mapping is used, the latency increases with the increase of frame length, but when the maximum is 9600, it is only 25us, which can be ignored. Figure 10 Experimental environment topology Figure 11 RFC2544 test results Therefore, we developed an optimization plan to adjust the transmission A plane first, and then adjust the B plane after the grayscale has run for a period of time. 6. Verification effectOn August 18, the transmission A plane was optimized: MAC transparent mapping was used for 100GE service mapping, and the alarm parameter delay was 200ms. After testing and verification, the active/standby switching of the transmission optical cable can be realized without the perception of the network device port and Redis. This was also verified in real fault scenarios. For example, at 15:13 on September 7, when the optical cable was disconnected on the transmission plane A, Redis reported no abnormal spikes. Figure 12 Redis error after optimization After a month of grayscale verification, we optimized the transmission B plane on September 15, and further shortened the alarm parameter delay transmission time from 200ms to 100ms. Tests also verified that Redis had no perception. VII. Future PlanningTo maintain the uniformity of the architecture, we will redefine the technical standards for Ctrip's optical network equipment, requiring that newly connected OTN equipment must support BIT-mapped alarm delay insertion. At the same time, we will encourage all suppliers to fully support this function, making it a best practice in optical cable failure scenarios. Resisting optical cable failures is a recognized problem in the industry, and leading Internet companies have all fallen into this trap. Through the above series of practices, we have achieved a leading level in resisting optical cable failures. Optical network operation and maintenance is a long-term process, and non-perceptual switching is only a small part of it. More importantly, it involves alarm discovery, performance monitoring, and optical cable route identification to avoid the occurrence of the same route. |
<<: Guidelines for Protecting RS-232 Serial Connections
>>: A “cat” walking alone on a narrow road: Cat.1 and narrowband communications (I)
On March 30, according to foreign media reports, ...
On July 27, Huawei Cloud held the TechWave AI Day...
Historical data has shown that performance has a ...
I would like to share some information about high...
More than a year after its official commercial la...
[Shenzhen, China, November 8, 2022] At the "...
Like most emerging IT trends, "edge computin...
TCP is a connection-oriented reliable transmissio...
Sharktech is an old computer room established in ...
DogYun has launched a promotion for Japanese data...
On December 4, the 2019 Fourth National Seminar o...
Recently, China Unicom officially announced its 2...
We have seen WiFi undergo rapid changes, and in 2...
Alibaba Cloud announced yesterday that it will in...
DiyVM is a brand of Hong Kong Ruiou International...