[Sharing] Project Practice of Network Automation: Scenarios, Tools and Solutions

[Sharing] Project Practice of Network Automation: Scenarios, Tools and Solutions

[51CTO.com original article] Network automation is to execute a series of network operations in batches on a group of network devices. It can be summarized into three components: command + operation + target device, which is applicable to two scenarios.

Scenario 1: Improve the efficiency of repetitive tasks for the operation and/or incremental configuration of existing networks. It includes two aspects (taking Cisco equipment as an example):

  • Routine enable mode operations, such as collecting device information for a group of devices, checking interface status and descriptions, checking device versions, and verifying configuration file consistency.
  • Provide common incremental configurations for multiple devices running smoothly on the existing network, such as adding or deleting an authorized login account, uniformly modifying the AAA authentication server/log/Netflow/NTP server address, updating the SNMP Community string, and uniformly closing a port.

Scenario 2: Provides rapid deployment for newly built large-scale data centers. It is mainly based on a baseline configuration template, calling variable files with device-specific information such as Location/Type/Role/Loopback to dynamically create a complete configuration file.

[[228164]]

1. Popular Network Automation Tools and Methods

Based on my personal work experience and the projects I am responsible for, I have summarized the following solutions and listed some details:

1. Cisco Prime Infrastructure, a closed system, an OVA file that runs on a virtual server.

  • Built-in proprietary OS, providing a proprietary shell interface, not the industry-standard Linux shell.
  • It is impossible to know where its inventory file (CVS format)/configuration file/template/variable is saved in the system.
  • There is no way to know what scripts are used to write its system, nor can it be modified to personalize it.
  • The system does not support any version control mechanism and cannot push local configuration files to GitHub.
  • The built-in closed Web daemon provides a user GUI interface and uses Oracle DB, but the user does not have a DB root account and a Prime user account, which means that the database files cannot be edited or modified.
  • Cicso PI is a comprehensive system that integrates network monitoring, log services, Netflow collection and analysis, configuration backup and network automation. However, each single item is not as powerful as a system that specializes in providing specific functions. For example, there is RANCID for configuration backup, Splunk for Syslog, Scrutinizer for Netflow, and Observium/Nagios/Check_MK for network monitoring.

2. RANCID, in my company’s deployment, is implemented through integration with Observium:

  • The router.db is dynamically generated by reading the Observium inventory file (also related to the Linux server's /etc/hosts) through RANCID's own script (generate-rancid.custom.php), but the description granularity of the device model is not fine enough.
  • Version control uses CVS or SVN, which is slightly older than Git.
  • Use the clogin –c “show clock” “show version” 'ls *aus*’ command to sequentially execute a set of enable mode operations for all network devices with the 'aus' character in the hostname.
  • Use the clogin –x commands.txt 'ls *sv4*' command to execute a set of configuration commands in sequence for all devices with the 'sv4' character in the hostname.
  • There is no interaction with the user during script execution. The user can only log in to the device using rancid's own account to operate, and the audit function is weak.
  • It is not as convenient and flexible as Python script calling command-set and device-group.
  • There is no exception handling mechanism (for example, if a batch operation is performed on 100 devices and the 30th device fails to log in due to a malfunction, the script will stop).

3. Ansible, initially developed to automate server deployment

(1) The implementation is complex and introduces many new concepts, including the following:

  • Role/Playbook
  • Jinja2 template
  • YAML grammar
  • Host Inventory
  • Engine/Modules/Plugins

(2) The process of an Ansible operation can show the complete Ansible architecture, as shown below:

(3) The seemingly simple command "Ansible-playbook templates.yaml" actually requires four inputs (tasks, hosts, jinja-template, vars), each of which is device-specific, that is, each device has four parameters specifically for itself. If the number of devices is not large enough, it is likely that the energy spent on maintaining files such as variables/templates/inventory will offset the efficiency improved by automation. To put it simply, I saved 30 minutes by using Ansible, but I needed to spend 2 hours in advance to prepare the relevant basic data that Ansible needs to call during execution.

(4) In theory, it is more suitable for the application scenario of newly built large-scale data centers. The network topology is symmetrical and hierarchical, and there are rules to follow. It is easy to develop a universal baseline template by building it according to a fixed pattern. Only a few variables are needed to generate device configuration files of different sites, types, and roles.

(5) Since the company I work for now is mainly a corporation network between the headquarters and branches, the networking principle of each remote site is similar to the campus network. Taking the LAN-Switch of one of the branches as an example, there are 57 personalized elements that need to be extracted as variables, which is a lot of work. Although there are only 6 VLAN ID variables, they are randomly distributed on 288 interfaces. The Jinca2 template must reserve the input of the VLAN ID variable for each Interface; even more complicated is the Interface description. As a LAN-Switch, we need to connect

The ports of WLC/VMW/NetApp/WAP/Zoom/Camera/Monitor systems are specially described, but the 46 interface description variables are also random on the 288 interfaces of the switch and have nothing in common with the LAN-Switches of other branches, making Ansible's variable files almost unmaintainable. The energy spent on maintaining variables and templates far outweighs the efficiency improved by automation.

4. There are several technical solutions that I have not studied in depth. I saw authoritative statements from industry experts in the NANOG conference materials. The summary is as follows:

  • Python with NAPALM (Python module): Cisco currently only supports IOS-XR, which is not suitable for enterprise networks.
  • NETCONF/YANG: An IETF standard from 12 years ago (RFC4741). So far, only JUNOS has good support for it.
  • XML via CLI: Schema: Not yet standardized.
  • RESTful API: Industry standards are still being developed.
  • OpenFlow: The latest version is 1.6, which only has one technical specification and is not enough to support the SDN ecosystem (IETF has 8000+ RFCs).

2. Differences between Network and Network

The above discusses several solutions that are currently popular in the industry. First of all, we must have a consensus: there is no perfect solution for network automation so far. Different business models and company sizes lead to great differences in their networking modes and networking technologies.

In addition, the ISP network (taking China Telecom's ChinaNet as an example) is mainly composed of the backbone (composed of all POPs and transmission networks) that covers the whole country and even the world, as well as the metropolitan area network that provides access lines that can extend to thousands of households; cloud computing providers (taking AWS as an example) mainly establish their own large-scale Internet data centers around the world, and use the wavelength division or bare optical fiber leased from ISPs to interconnect all IDCs; enterprise networks are mainly composed of relatively small-scale Production backbone/IDCs and Corporation networks used to interconnect company headquarters and branches.

Available options at this stage

These three types of network owners have different needs for network automation. No single network automation solution can adapt to and solve all network architectures. The feasible solutions I propose here are recommended based on the specific reality of my company's network.

Our company's network is an enterprise network with the following characteristics:

  • The network structure is stable and changes infrequently, with an average of one new branch added each year.
  • The content of each network adjustment is not complicated. After the monitor is changed from Solarwinds to Observium, the SNMP Community string needs to be modified. After the Netflow server is changed from NFSEN to Scrutinizer, the Collector server address needs to be modified.
  • The large public data center has been migrated to AWS. The company only maintains a small Internal DC for internal enterprise services (personnel, legal, financial data, source code, email, AD/LDAP/RSA, etc.), with a network scale of 24 racks and less than 40 switches.
  • The network is composed of equipment from multiple manufacturers, including Cisco, Juniper, F5, CheckPoint, etc.

Based on the above analysis, the biggest feature of our daily network operation and maintenance is to log in to multiple devices in sequence to perform repetitive operations, or to incrementally configure some functions based on the stable operation of the existing network. We believe that the use of Python combined with the Netmiko module is feasible and meets our needs at this stage.

3. Introduction to Python and Netmiko

This solution is the best current practice of our company.

Python is a very popular scripting language in the industry. It is highly readable, has rich modules and functions, and has many active communities contributing to network automation.

Netmiko is an open source Python module maintained by Kirk Byers (https://pynet.twb-tech.com/). All source code and scripts, JSON files, technical documentation and application examples are available for reference and download on GitHub (https://github.com/ktbyers/netmiko).

Netmiko simplifies SSH management of network devices such as:

1. Supports devices from different manufacturers and devices from the same manufacturer on different platforms to successfully establish SSH connections.

  • Based on the differences between Cisco and Juniper CLI, enable and conft are automatically and implicitly entered.
  • For Cisco WLC, login character input is automatically blocked.

2. For multi-vendor environments, all operations in Enable mode can be successfully executed and the interaction information of the devices can be fed back to the machine executing the script.

3. For multi-vendor environments, all operations in Configure mode can be successfully executed and the interaction information of the devices can be fed back to the machine executing the script.

Based on the above characteristics and the specific needs of our company, we focused on implementing the network automation of scenario 1, which can be highly refined into the following structure:

  1. Python script + command-set + device-group

Prepare two Python scripts named enable.py and configure.py, which represent the script files for executing enable or configure mode respectively. Each script can call two parameters, one is the command set (command-set) to execute a set of operation instructions, and the other is the device group (device-group) that is, the object to be executed by the script. We have also optimized the script and provided an interactive operating environment. The user only needs to type the script name, and Python and Linux shell interact to automatically ask the user which set of commands to execute. After the user makes a selection, the script automatically asks the target device to be executed. After the user makes another selection, Python executes the script in order and outputs the results to the Linux server.

In order to facilitate the execution of two different scripts, we have also developed some file name conventions. The command set is essentially a txt file. In order to distinguish between enable and configure mode, the suffixes .enab and .conf are used to represent them respectively; Device-group must use Python dictionary data structure, which is in JSON format and can be directly represented by .json. The file naming convention is mainly conducive to providing correct output for users to choose when interacting with users in the Linux shell.

The script is executed in a Linux server with PIP and Netmiko installed. The script calls modules/functions such as json, netmiko, sys, signal, and os to implement the required functions. In addition, import getpass is used to log in to the target device through the user's AD account authentication, and the user password is hidden in the Python output. The script uses for loop to execute the command set line by line and log in to the device set one by one.

Here is the source code of enable.py:

  1. #!/usr/bin/env python
  2.  
  3. from __future__ import absolute_import, division, print_function
  4. from getpass import getpass
  5. import json
  6. import netmiko
  7. #from netmiko import ConnectHandler
  8. #from netmiko.cisco import CiscoIosBase (device type: "cisco_ios", "cisco_xe")
  9. #from netmiko.cisco import CiscoIosBase (device type: "cisco_xe")
  10. import sys
  11. import signal
  12. import os
  13.  
  14. signal.signal(signal.SIGPIPE, signal.SIG_DFL) # IOError: Broken pipe
  15. signal.signal(signal.SIGINT, signal.SIG_DFL) # KeyboardInterrupt: Ctrl-C
  16.  
  17. def get_input( prompt = '' ):
  18. try:
  19. line = raw_input (prompt)
  20. except NameError:
  21. line = input (prompt)
  22. Return line
  23.  
  24. def get_credentials():
  25. """Prompt for and return a username and password."""
  26. username = get_input ('Username(Please input your adm credentials): ')
  27. password = getpass ()
  28. return username, password
  29.  
  30. netmiko_exceptions = (netmiko.ssh_exception.NetMikoTimeoutException,
  31. netmiko.ssh_exception.NetMikoAuthenticationException)
  32.  
  33. username, password = get_credentials ()
  34.  
  35. os.system('find *.enab')
  36. os.system('echo')
  37. os.system('echo "^^^^^^^^^^^^^^^^^^^^^^^^^^^^"')
  38. commandfile = raw_input ("Please select what command you want to run: \n")
  39.  
  40. os.system('find *.json')
  41. os.system('echo')
  42. os.system('echo "^^^^^^^^^^^^^^^^^^^^^^^^^^^^"')
  43. devicegroup = raw_input ("Please select what device-group you want to apply to: \n")
  44.  
  45. with open(commandfile) as cmd_file:
  46. commands = cmd_file .readlines()
  47.  
  48. with open(devicegroup) as dev_file:
  49. devices = json .load(dev_file)
  50.  
  51. for device in devices:
  52. device['username'] = username
  53. device['password'] = password
  54. try:
  55. print('~' * 80)
  56. print('Connecting to device:', device['ip'])
  57. connection = netmiko .ConnectHandler(**device)
  58. for command in commands:
  59. print(connection.send_command(command))
  60. """To keep 2 lines space between 2 devices"""
  61. print()
  62. print()
  63. connection.disconnect()
  64. except netmiko_exceptions as e:
  65. print('Failed to ', device['ip'], e)

Once a Python script is created, it does not need to be frequently updated. The only thing that needs to be maintained is the command-set and device-group information, because each task is different, including the different operation instructions that need to be executed in batches, and the devices that need to be operated each time are also task-specific. This requires the specific executor of the script to write a separate command set and device group file for his or her own operation.

For example, we want to call a set of commands through enable.py to execute them on all devices in the lab in sequence. The device-group file is as follows (Python dictionary file, Json format):

  1. [
  2. {
  3. "ip": "lab-wan-isr4431-1",
  4. "device_type": "cisco_xe"
  5. },
  6. {
  7. "ip": "lab-wan-isr4431-2",
  8. "device_type": "cisco_xe"
  9. },
  10. {
  11. "ip": "lab-wan-c3650-1",
  12. "device_type": "cisco_xe"
  13. },
  14. {
  15. "ip": "lab-wan-c3650-2",
  16. "device_type": "cisco_xe"
  17. },
  18. {
  19. "ip": "lab-lan-c3850ss-1",
  20. "device_type": "cisco_xe"
  21. }
  22. ]

Assume that the show clock command is executed on each of the above devices in turn. The content of .enab is as follows (txt format):

  1. show clock

The execution result will be output in STD1 mode on the Linux server where this script is executed as follows:

  1. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  2. Connecting to device: lab-wan-isr4431-1
  3. 03:42:24.324 UTC Sat Apr 28 2018
  4.  
  5. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  6. Connecting to device: lab-wan-isr4431-2
  7. 03:42:29.472 UTC Sat Apr 28 2018
  8.  
  9. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  10. Connecting to device: lab-wan-c3650-1
  11. *03:40:15.780 UTC Sat Apr 28 2018
  12.  
  13. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  14. Connecting to device: lab-wan-c3650-2
  15. *03:42:05.544 UTC Sat Apr 28 2018
  16.  
  17. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  18. Connecting to device: lab-lan-c3850ss-1
  19. *03:42:45.277 UTC Sat Apr 28 2018

In the example here, command-set has only one show clock command, but it can actually be composed of multiple CLI commands in any combination, which is very flexible. I have pre-defined some command sets and device groups in the project, and pushed them to the GitHub Repository together with the Python script. At the same time, I installed the Pycharm IDE tool and the GNS3 network device simulation environment, as well as the GitHub Desktop client on my laptop. With these tools, users can edit scripts and supporting files locally, verify them in a virtual network environment, and implement version management of all documents.

For the device group, we initially hoped to automatically retrieve from the existing network management system through API or script. After research, I personally feel that it is not feasible for the time being. Our company currently has several systems such as RANCID/PI/Observium/IPAM. First of all, the data format of device-group is inconsistent. RANCID is TXT, PI is CVS, Observium is JSON, and IPAM is a proprietary format. In addition to the different data formats, the most critical information for Netmiko is Device-type, which is also the biggest advantage of Netmiko's support for cross-platform and multi-vendor. The operation details after SSH login to the device are all guided by the clear specification of the device type.

The aforementioned systems either do not specify the device type (for example, PI defaults to Cisco, IPAM does not care about the device type at all, and the Observium system does not control the device without SSH login, so there is no device type information), or the device type information is coarse-grained (RANCID only has Cisco for Cisco devices, but in fact Cisco devices can be divided into IOS-XE, NX-OS, IOS-XR and IOS). Based on this, we have currently given up the idea of ​​automatically creating a device-group dictionary file. Taking a step back, even if we use a script to automatically create a device-group in json format, we still need to manually specify which devices belong to which group, the degree of manual participation has not been reduced, and the efficiency improvement is limited.

IV. Other issues to consider

To sum up, this is the network automation that our company has now achieved. Since new sites are added to the network infrequently, with one site added each year, we have temporarily decided not to use Ansible/Jinca2/Task/Variable to implement the deployment of this new network, mainly because it is not worth it. We have a simpler and faster way: take the configuration file of an existing site as a template, modify personalized elements such as host name, loopback address, SNMP location, BGP peer, OSPF area, Interface IP, etc., inject it into the equipment, transport it to the site, power it on and start it, without complicating simple problems.

In addition, it should be recognized that the first purpose of network automation is to improve efficiency. There are only a limited number of VPN gateways in the entire network, LB is only set up in the data center, and each site security policy does not have a common FW, so there is no need to implement network automation.

Finally, I have some concerns about network automation, which is the potential risks. After all, a script performs a series of operations on a group of devices. If the command set is configured incorrectly, it may cause the entire network to be paralyzed. There is currently no perfect rollback mechanism, and once it happens, the result will be catastrophic. This is why I installed GNS3 on my laptop. It needs to be verified every time to ensure that it is correct before it can be executed. The supporting management process also needs to keep up. Detailed information on the command set and device group needs to be submitted on the Change work order in case a rollback is needed, so as to make sure you know which operations to roll back on which devices.

The above is a summary of my work practice and project research. Network automation is the trend of the network industry. Whether network engineers like it or not, they can only follow the trend, otherwise they will be in danger of being eliminated. Technical solutions are still improving, and there will be new developments in the future. Even for existing means, different companies will have different recommendations and implementations. I hope that my article can serve as a starting point for discussion. I also hope to communicate with experts in the same industry. If you have different opinions on the content of the article, you are welcome to correct it, learn from each other, and improve together.

[[228165]]

Hu Jie has worked for China Netcom (CNC), Verizon, Juniper and China Telecom in China. He has participated in the technical solution writing and metropolitan area network optimization and transformation project of China Telecom CN2 project, as well as China Telecom IPv6 research and live network testing. Since June 2014, he has been working as a network engineer in an Internet company in San Francisco.

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<:  China Unicom's mixed ownership reform is the trigger for the restructuring of the telecommunications industry order

>>:  What is AI intelligence engine?

Recommend

What does the all-out war on IPv6 mean for China's Internet?

[[238041]] Image source: Visual China IPv4. What ...

12 principles to make data centers perform better

As American football star Tom Brady once said, &q...

InMotion Hosting Acquires RamNode

LEB released this news on March 4: InMotion Hosti...

5G enters the second half, the difficulty of ToB lies in the "three highs"

More than two years after the licenses were issue...

Wi-Fi CERTIFIED Vantage adds support for the latest Wi-Fi features

Recently, Wi-Fi Alliance launched new features fo...

The Cybersecurity Law was promulgated: 6 highlights

On November 7, the 24th meeting of the Standing C...

Do you know the origin and function of Wi-Fi?

Since its introduction 25 years ago, Wi-Fi has pl...

These five points cannot be ignored when selecting enterprise SD-WAN!

As the main theme of today's IT industry, clo...

What is Wi-Fi-6E and how is it different from Wi-Fi-6

Three years ago, Wi-Fi 6 technology entered the m...