Preface The exploration of DCI technology has been very popular recently, especially after SDN-WAN has been paid attention to by everyone, and articles introducing DCI technology have emerged one after another. This article focuses on the technology and current situation of the transmission network part of the DCI network, hoping to bring some help to everyone, please be gentle.
1. The origin of DCI network Data centers were originally quite simple. A random room with a few cabinets and a few high-P air conditioners, and then a single-channel ordinary mains electricity and a few UPSs became a data center. However, this type of data center is small in scale and has low reliability. Since the late 1990s, the Internet has been developing rapidly, and the demand for data centers has also skyrocketed. Therefore, the unsolvable problems of this type of data center have arisen: insufficient space, insufficient power supply, no redundancy, and no SLA guarantee, which has caused users to start looking for another data center to deploy their services. At this time, the new data center and the old data center began to have network interconnection needs, resulting in the initial DCI network, namely Data Center Inter-connect, which includes technologies at the physical network level and the logical network level. The original DCI network was directly connected through the Internet. Later, security was taken into consideration and encryption was used. Considering the quality of service, dedicated lines were used. Considering bandwidth, direct fiber connections were used. 2. Development of DCI Network DCI network has developed from Internet interconnection to several M dedicated lines and now tens of terabytes of wavelength division interconnection. In fact, it has not been long, and it is objectively a response to the development of the Internet. Initially, users used public network VPN tunnels to transmit their services directly through the public network. This method is affected by the public network environment (global network bandwidth congestion, inferior routing, line jitter, link reset, firewall, etc.) and cost conditions, so it is suitable for small traffic, low bandwidth quality requirements, non-real-time, and low security requirements. Later, the services in the data center received more and more attention, and the services began to be gradually deployed, and the number of servers grew linearly. After a large number of services were deployed, these services transmitted through the network had an increasing impact on companies and industries, so the requirements for the network became higher and higher, first reflected in bandwidth and link stability. Therefore, data center users began to rent operator circuit dedicated lines. MSTP dedicated lines carried on SDH networks began to sell well due to their high stability, larger bandwidth, and high degree of operator reuse. Later, as business continued to grow explosively, data between data centers began to have requirements for latency and redundancy, especially for financial customers, which had particularly high requirements for latency, so the requirements for dedicated lines also increased. Furthermore, users began to require larger granularity, such as 2.5G and 10G single-link bandwidth, and also required dual routing protection to ensure that LSA reached 4 to 5 9s. Even so, the power of Internet development is amazing, and businesses such as log transmission and database synchronization are growing rapidly. Considering the needs of cost, delivery time, service quality and other aspects, top companies (especially Internet companies with big data such as Google and FB) began to build their own DCI networks by renting bare optical fibers without operators. In the early days of using bare optical fibers, a single signal was first run through a single optical fiber. For example, a pair of optical fibers with a 10G ZR module can transmit 80 kilometers, which is enough for transmission between general data centers in the same city. However, the disadvantage of this method is that on the one hand, the use of optical fibers increases linearly with bandwidth and the cost increases. On the other hand, the bandwidth utilization of a single optical fiber is too low (of course, for operation and maintenance, the resource and routing management of bare optical fibers is also a long-standing problem). Moreover, at this time, the 10G bandwidth of a single optical fiber cannot meet the needs of business growth, so the DCI network entered the WDM era. In the WDM era, two methods appeared in the DCI network, namely coarse wavelength division CWDM and dense wavelength division DWDM. Initially, some users used 10G coarse wavelength division optical modules and CWDM technology for DCI interconnection based on cost considerations, but this system supports up to 16 waves of 10G, and EDFA cannot amplify the signal of the wavelength where the coarse wavelength division is located, and its passive relay distance is also quite limited. Therefore, with the increasing demand for large-scale data transmission, users have to start using DWDM systems with higher capacity and performance. DCI network structure of pure DWDM system: DWDM system is the most important large bandwidth transmission system in the current communication network. In consideration of business volume, cost, operation and maintenance, various companies used switches to insert 10G color optical modules in the early stage, and the color optical Ethernet signals that came out were directly connected to passive DWDM equipment. This system is simple to operate and maintain. Under the condition of controllable cost, it can generally achieve 40 waves * 10GE signals, and the total system bandwidth can reach 400G, without excessive network operation and maintenance costs. Ordinary IP network engineers can perform operation and maintenance with low-cost learning. It was once widely used by some companies with demand. However, the development of the Internet is booming, and the 10GE single-wave signal will soon be unable to meet the demand. At this time, a higher single-wave signal system is needed, such as single-wave 40GE&100GE. However, the 40GE&100GE Ethernet color optical modules that can be placed on the switch at this time have not yet appeared on the market, and the cost after they appear is also high for a long time, but the business cannot wait, so it can only find another way. So OTN, which is a big name in telecommunications networks, appeared in the DCI networks of Internet companies such as Google and FB. When the Internet began to use OTN, it basically started with 10GE. At that time, 10G color optical fiber + DWDM box could no longer meet the needs of business growth, and this method could not meet the operation requirements of batch networks, especially long-distance networks, because there was no management method based on the optical layer. In addition, after the launch of 100G OTN, after several years of development, especially after several centralized procurements by operators, the cost dropped significantly. For these reasons, the 10GE color optical fiber + DWDM box method began to be slowly replaced by the OTN system. Due to technological development and cost reduction, the 40G OTN system has basically not been used in the DCI network of the Internet industry. It is directly upgraded from the 10G system to the 100G system. In the middle, it was also considered that if the 100G system is used, the scope of impact after the failure is wide, but the demand for business growth is still the primary, so the line side (the signal facing the transmission optical fiber side) is directly upgraded from 10G to 100G, ensuring that the wavelength division system on the line side can meet the needs of long-term bandwidth development. As for the client side (the signal facing the switch docking side), considering that there are many existing 10G client systems, in order to protect existing investments, it is necessary to be compatible with existing 10G granular services and be able to upgrade to client-side 100G systems in the future. Therefore, in order to allow the 10G and 100G systems to transition during DCI network upgrades, a service card based on 10 ODU2s–>1 ODU4 with a client side of 10*10G and a line side of 1*100G appeared, and this board card has been widely used. OTN network, with its rich management overhead, high reliability and diversified protection methods, centralized professional NMS management platform, and large bandwidth, has indeed played an extremely important role in the development of the Internet, making network operations more professional and segmented, and of course, most importantly, meeting the rapidly growing needs of Internet services. Typical topology of a point-to-point DCI network using OTN: At this time, OTN network is no longer a proprietary technology of telecommunications network. The rise of the Internet has allowed such a traditional telecommunications network technology to enter the DCI network industry. 3. Current Operational Routes of DCI Network After the introduction of OTN technology into the DCI network, a whole new task has been added to the operation. The traditional data center network is an IP network, which belongs to the logical network technology. The OTN in DCI is a physical layer technology. How to work with the IP layer in a friendly and convenient way is a long way to go for the operation. The purpose of the current operation work based on OTN is the same as that of each subsystem in the data center, which is to maximize the effectiveness of the high-cost resources invested in the infrastructure and provide the best support for upstream businesses. It improves the stability of the basic system, facilitates the efficient operation and maintenance work, assists in the reasonable allocation of resources, makes the invested resources play a greater role, and reasonably allocates the uninvested resources. The operation of OTN mainly involves several aspects: operation data management, asset management, configuration management, alarm management, performance management, and DCN management. 3.1 Operational Data Statistics are collected on fault data to distinguish between human faults, hardware faults, software faults, and third-party faults. Statistical analysis is performed on the types with higher fault rates, and targeted processing plans are developed. After standardization in the future, this will pave the way for automated fault processing. Based on fault data analysis, the system is optimized from the perspectives of architecture design and equipment selection to reduce the cost of later operation and maintenance work. Fault statistics are collected for OTN from optical amplifiers, boards, modules, combiners, splitters, cross-device fiber jumpers, trunk optical fibers, DCN networks, etc., and data analysis is performed in multiple dimensions, including manufacturer dimensions and third-party dimensions, so that the data can more accurately reflect the current status of the network. Statistics are collected on the change data to distinguish the complexity and impact of the change, and personnel are assigned. Changes are made according to the process of demand analysis, change plan, window setting, user notification, operation execution, and summary review. Finally, different changes can be divided into windows, and even arranged to be executed during the day, so that the allocation of change personnel is more reasonable, reducing work and life pressure and improving the happiness of operation engineers. The final statistical data can be integrated and used as a reference for personnel work efficiency and work ability, while also allowing normal changes to develop in the direction of standardization and automation, reducing various expenses. Data statistics on OTN service distribution can help you keep track of network usage and control the network and service distribution of the entire network after the business volume increases. At a high level, you can know which network service uses a single channel, such as the external network, internal network, HPC network, cloud service network, etc. At a high level, you can combine the full traffic system to analyze the specific service traffic usage, allocate different bandwidth costs to different business departments, help them optimize service traffic, recycle and adjust low-usage working channels at any time, and expand high-usage service channels. Statistics of stability data are the main reference data for SLA and the sword of Damocles hanging over the heads of every operation and maintenance personnel. Statistics of stability data of OTN are differentiated because they have protection. For example, if a single route is interrupted and the total bandwidth at the IP level is not affected, should it be included in the SLA? If the IP bandwidth is halved but does not affect the service, should it be included in the SLA? If a single channel fails, should it be included in the SLA? If the delay of the protection path increases, although it has no impact on the network bandwidth, it has an impact on the service, should it be included in the SLA, etc. The general practice is to inform the business party of the risks of jitter, delay change, etc. before construction. In the later SLA, the number of faulty channels * the bandwidth of a single faulty channel is used as the calculation basis, divided by the total number of channels * the sum of the corresponding channel bandwidths, and then multiplied by the impact time. The value obtained is used as the calculation standard of the SLA. 3.2 Asset Management The assets of OTN equipment also need lifecycle management (arrival, online and offline, scrapping, and troubleshooting), but unlike servers, network switches and other equipment, the structure of OTN equipment is more complex. OTN equipment involves a large number of functional boards, so a model needs to be designed during management to enable full asset management. The main IP asset management platform in the data center is based on servers and switches, and will set the master and slave device levels. On this basis, OTN will involve hierarchical management at the master and slave levels, but there are more levels. The management level is mainly based on network element->subrack->board->module:
All this information can be collected through the northbound interface of the network management platform, and the accuracy of asset information can be managed through online collection and offline verification. In addition, OTN equipment also involves optical attenuators, short jump fibers, etc. These consumable devices can be directly managed as consumables. 3.3 Configuration Management When configuring channels, it is necessary to configure services, configure optical layer logical links, and configure link virtual topology. If a single channel may be configured with a protection path, the channel configuration will be more complicated, and the configuration management will be more complicated. A dedicated service table is required to manage the channel direction alone, and the service direction must be distinguished in the table using solid and dotted lines. When OTN channels and IP links are managed in correspondence, especially in the case of OTN protection, one IP link needs to correspond to multiple OTN channels. At this time, the management volume increases and the management is complicated. The need to manage Excel tables is increased. To fully manage all elements of a service, there are up to 15. When an engineer wants to manage a certain link, he needs to find this Excel table, then go to the manufacturer's NMS to find the corresponding one, and then perform operation management. This requires more synchronization of information on both sides. Since the OTN NMS platform and the Excel made by the engineer are both artificial data, it is easy to have information asynchrony. Any error will cause the service information to not correspond to the actual relationship, which may affect the service when making changes and adjustments. Therefore, the manufacturer's equipment data is collected to a management platform through the northbound interface, and then the IP link information is matched on this platform so that the information can be automatically adjusted according to changes in the existing network services, ensuring centralized management of information and a single accurate source, and ensuring the accuracy of configuration management information. When configuring OTN services, prepare a description of each interface, then collect OTN information through the northbound interface provided by the OTN NMS. Pair the relevant description with the port information collected by the IP device through the northbound interface, and the OTN channel and IP link can be managed on a platform, eliminating the need for manual information updates. When using the DCI transmission network, try to avoid using electrical cross-connection service configuration. This method is extremely complex to manage and is not suitable for the DCI network model. It can be avoided from the beginning of DCI design. 3.4 Alarm Management Due to the complex management overhead, signal monitoring during long-distance transmission, multiplexing and nesting of different service granules, etc., a fault in OTN may report dozens or even hundreds of alarm messages. Although the manufacturer has made four levels of alarm classification, each alarm has a different name. From the perspective of engineer operation and maintenance, it is still extremely complicated, and experienced personnel are required to determine the cause of the fault in the first place. The fault outbound function of traditional OTN equipment is mainly to use SMS modems or email push, but these two functions are relatively special for the existing network alarm management platform integrated with the basic system of Internet companies, and the cost of separate development is high, so a more standard northbound interface is needed to collect alarm information, expand the function while retaining the company's existing related platforms, and then push the alarm to the operation and maintenance engineers. Therefore, for the operation and maintenance personnel, what is needed is to let the platform automatically converge the alarm information generated by the OTN fault, and then receive this information. Therefore, first set the alarm classification on the OTN NMS, and then send and filter on the last alarm information management platform. The general OTN alarm practice is that the NMS will set up to push all the first and second category alarms to the alarm information management platform, and then the platform will push the alarm information of single service interruption, main optical path interruption alarm information, and (if any) protection switching alarm information to the operation and maintenance engineer according to time, summarized information, recipient range and other dimensions. With the above three pieces of information, the fault can be judged and handled. When setting up the reception, you can set up telephone notifications for major alarms such as combined wave signal failures that are only generated by broken optical fibers, such as the following: The northbound interface of NMS, such as the XML interface currently supported by Huawei, ZTE and Alcatel-Lucent for pushing alarm information, is also commonly used. 3.5 Performance Management The stability of the OTN system is highly dependent on the performance data of various aspects of the system, such as the optical power management of the trunk fiber, the optical power management of each channel in the combined signal, and the system OSNR margin management. These contents should be included in the monitoring project of the company's network system so that the system performance can be understood at any time and the performance can be optimized in time to ensure the stability of the network. In addition, long-term fiber performance quality monitoring can also be used to detect changes in fiber routes to prevent some fiber suppliers from changing fiber routes without notification, resulting in blind spots in operation and maintenance and the risk of fiber routing. Of course, this requires a large amount of data for model training in order to detect route changes more accurately. 3.6 DCN management The DCN here refers to the management communication network of OTN equipment, which is responsible for the networking structure of managing each OTN network element. The networking of OTN will also affect the scale and complexity of the DCN network. There are two general DCN network methods:
At the beginning of DCN network construction, network element planning and IP address allocation should be carried out well, especially when the network management server is deployed, it should be isolated from other networks as much as possible. Otherwise, there will be too many mesh links in the network in the later stage, network jitter will be common during maintenance, and ordinary network elements will not be able to connect to gateway network elements. It will also be easy for the production network address and the DCN network address to be reused, causing the production network to be affected. 4. DCI Network Development Direction When building inter-data center network interconnection, data center owners mainly consider large bandwidth, low latency, high density, fast deployment, easy operation and maintenance, and high reliability. The current mainstream large-bandwidth OTN technology is mainly controlled by several large telecommunications equipment manufacturers (chips are another matter), such as Huawei, ZTE, and Alcatel-Lucent. Their main customers are traditional telecommunications operators, so the OTN product features are mainly designed based on the business characteristics of these operators. Because of this, there are more and more discordant problems in the DCI network applications of OTN in the Internet industry. The characteristics of OTN equipment are also the problems encountered by DCI, rich business overhead, the network has strong OAM capabilities, scheduling and multiplexing capabilities of different granular bandwidths, line fault tolerance in long-distance situations, low-voltage direct current, and low equipment power consumption utilization.
At present, our DCI network mainly provides a pipeline for data across data centers. The main characteristics of the business model are: unified and single bandwidth granularity requirements, large bandwidth, low latency requirements for cross-data center services (especially multi-active IDC and big data services), and high requirements for network stability. At the same time, due to the lack of relevant professional and technical personnel in the Internet industry, the operation and maintenance of the DCI network needs to be "simple", "simple", and "simple" - important things should be said three times (which network is not?). The explosive development of the Internet has shortened the construction and expansion cycle requirements (the operator's OTN expansion cycle is generally half a year to one year, while the Internet's own DCI expansion requirement is 1 to 3 months), so it is necessary to compress the time in all aspects. Therefore, OTN provides a usable solution for DCI, but OTN is by no means the most suitable solution for DCI. As DCI networks are booming, more and more suitable solutions are needed to solve various problems from cost to construction and operation and maintenance. These problems are nothing more than the six requirements of DCI networks (large bandwidth, low latency, high density, fast deployment, easy operation and maintenance, and high reliability):
Based on these characteristics, there are currently two conventional DCI solutions:
In addition, the first-tier DCI network builders are mainly working on decoupling the DCI transmission network, including the decoupling of the optical layer at layer 0 and the electrical layer at layer 1, as well as the decoupling of the NMS and hardware devices of traditional manufacturers. The traditional approach is that the electrical processing layer equipment of a certain manufacturer must be coordinated with the optical layer equipment of the same manufacturer, and the hardware equipment must be managed with the manufacturer's proprietary NMS software. This traditional approach has several major drawbacks:
Therefore, optoelectronic decoupling is a new direction for the development of DCI transmission networks. In the foreseeable future, the optical layer of DCI transmission networks can be SDN technology composed of ROADM+ north-south interfaces, which can arbitrarily open, schedule and recycle channels. It will be possible to mix electrical layer devices from multiple manufacturers in the system, and even mix Ethernet interfaces and OTN interfaces on the same optical system. By then, the efficiency of system expansion and modification will be greatly improved, the optoelectronic layer will be easier to distinguish, the network logic management will be clearer, and the cost will be greatly reduced. For SDN, the core premise is the centralized management and allocation of network resources. So, what are the DWDM transmission network resources that can be managed on the current DCI transmission network? Channels, paths, and bandwidth (frequency) are the three things. Therefore, the light in the collaboration of light + IP is actually managed and allocated around these three points. IP and DWDM channels are decoupled, so if the correspondence between an IP logical link and a DWDM channel is configured in the early stage, and the correspondence between the channel and IP needs to be adjusted later, OXC can be used to perform millisecond-level fast channel switching, which can make the IP layer imperceptible. Through the management of OXC, centralized management of transmission channel resources at each site can be achieved, so as to cooperate with the SDN of services. The decoupling adjustment of a single channel from IP is only a small part. If you consider adjusting the bandwidth while adjusting the channel, you can solve the problem of adjusting the bandwidth demand of different services in different time periods, and greatly improve the utilization rate of the built bandwidth. Therefore, while cooperating with OXC to adjust the channel, the combiner/splitter combined with flexible grid technology can make a single channel no longer have a fixed central wavelength, but cover a scalable frequency range, so as to achieve flexible adjustment of the bandwidth size. In addition, in the case of using multiple services in a network topology, the frequency utilization rate of the DWDM system can be further improved, and the existing resources can be saturated. With the dynamic management capabilities of the first two, the path management of the transmission network can help the entire network topology have higher stability. According to the characteristics of the transmission network, each path has independent transmission channel resources, so it is of great significance to uniformly manage and allocate the channels on each transmission path. This will provide the best path selection for multi-path services and maximize the use of channel resources on all paths. Just like in ASON, different services are divided into gold, silver and bronze to ensure the stability of the highest level of services. For example, there is a ring network consisting of three data centers A, B, and C. There is service S1 (such as intranet big data service), from A to B to C, occupying waves 1 to 5 of this ring network, each wave has a bandwidth of 100G, and the frequency interval is 50GHz; there is service S2 (extranet service), from A to B to C, occupying waves 6 to 9 of this ring network, each wave has a bandwidth of 100G, and the frequency interval is 50GHz. Normally, this bandwidth and channel usage meets the demand. However, sometimes, for example, when a new data center is added and the business needs to migrate the database in a short time, the demand for intranet bandwidth in this period of time will increase several times. The original 500G bandwidth (5 100G) now requires 2T bandwidth. Then the channels at the transmission level can be recalculated, and 5 400G channels are deployed in wave layers. The frequency interval of each 400G channel is changed from the original 50GHz to 75GHz. With the flexible grating ROADM and combiner/splitter, the path of the entire transmission level is opened up, so these 5 channels occupy 375GHz of spectrum resources. After the resources at the transmission level are ready, the OXC is adjusted through the centralized management platform. Under the millisecond delay, the transmission channels used by the original 1~5 waves of 100G service signals are adjusted to the newly prepared 5 400G service channels. In this way, the bandwidth and channels can be flexibly adjusted according to the business needs of DCI, which can be done in real time. Of course, the network connector of the IP device here needs to support 100G/400G rate adjustable and optical signal frequency (wavelength) adjustment functions, which will not be a problem. : : : : : : : : : : : : : : : Perhaps in the near future, OTN will also disappear in telecommunications-grade networks, leaving only DWDM. Author profile: Li Yan, who has been responsible for the construction and operation of transmission networks in the Internet industry for many years, has a thorough understanding of DCI transmission networks, and is currently mainly responsible for the operation of the basic data center network (L1~L4). |
>>: How to increase the speed of the router
On January 25, China's Ministry of Industry a...
In the dynamic world of telecommunications, the a...
HTTP security headers are a fundamental part of w...
It is understood that IPv6 was first introduced i...
What problems does each generation of HTTP solve?...
2020 is a critical period for the commercial deve...
"Industry and Information Technology V News&...
The Ministry of Industry and Information Technolo...
The American Forbes website recently published an...
[51CTO.com original article] On October 1, 2000, ...
The "F5 China Application Service Summit For...
HTTP is the most important and most used protocol...
The issuance of 5G licenses in China has greatly ...
When you look at your mobile network or home broa...
[[338791]] At 13:00 on the afternoon of August 20...