NVIDIA® Cumulus Linux is the first full-featured Debian Buster-based,
Linux operating system for the networking industry.
This user guide provides in-depth documentation on the Cumulus Linux installation process, system configuration and management, network solutions, and monitoring and troubleshooting recommendations. In addition, the quick start guide provides an end-to-end setup process to get you started.
Cumulus Linux 5.5 includes the NVIDIA NetQ agent and CLI. You can use NetQ to monitor and manage your data center network infrastructure and operational health. Refer to the NVIDIA NetQ documentation for details.
For a list of the new features in this release, see What's New. For bug fixes and known issues present in this release, refer to the Cumulus Linux 5.5 Release Notes.
Try It Pre-built Demos
The Cumulus Linux documentation includes pre-built Try It demos for certain Cumulus Linux features. The Try It demos run a simulation in NVIDIA Air; a cloud hosted platform that works exactly like a real world production deployment. Use the Try It demos to examine switch configuration for a feature. For more information, see Try It Pre-built Demos.
Open Source Contributions
To implement various Cumulus Linux features, NVIDIA has forked various software projects, like CFEngine Netdev and some Puppet Labs packages. Some of the forked code resides in the NVIDIA Networking GitHub repository and some is available as part of the Cumulus Linux repository as Debian source packages.
NVIDIA has also developed and released new applications as open source. The list of open source projects is on the Cumulus Linux packages page.
Download the User Guide
Use one of the following methods to download the Cumulus Linux user guide and view it offline:
Host the documentation on a local host using hugo.
For a fully functional copy of the user guide, download a zip file of an HTML documentation build for offline use. Download the desired version, extract it locally, then open cumulus-linux-55.html in your web browser.
To view this user guide as a single page to print to a PDF with limited functionality, click here.
Click on the link one time and use the web browser print-to-PDF option to save the PDF locally.
What's New
This document supports the Cumulus Linux 5.5 release, and lists new platforms, features, and enhancements.
Cumulus Linux 5.5.1 provides a new SDK and firmware version, and includes a bug fix to resolve a link degradation issue.
What’s New in Cumulus Linux 5.5.0
Cumulus Linux 5.5.0 supports new platforms, contains several new features and improvements, and provides bug fixes.
Early access features are now called beta features.
Platforms
NVIDIA SN3750-SX (100G Spectrum-2) continues to be in beta
The NVIDIA SN3750-SX switch is available for beta and open to customer feedback. Do not use this switch in production; it is not supported through NVIDIA networking support.
New Features and Enhancements
1G support for all NVIDIA Spectrum-2 and Spectrum-3 switches now generally available
Cumulus Linux 5.5 includes the NVUE object model. After you upgrade to Cumulus Linux 5.5, running NVUE configuration commands might override configuration for features that are now configurable with NVUE and removes configuration you added manually to files or with automation tools like Ansible, Chef, or Puppet. To keep your configuration, you can do one of the following:
Use Linux and FRR (vtysh) commands instead of NVUE for all switch configuration.
Cumulus Linux 3.7, 4.3, and 4.4 continue to support NCLU. For more information, contact your NVIDIA Spectrum platform sales representative.
Quick Start Guide
This quick start guide provides an end-to-end setup process for installing and running Cumulus Linux.
Prerequisites
This guide assumes you have intermediate-level Linux knowledge. You need to be familiar with basic text editing, Unix file permissions, and process monitoring. A variety of text editors are pre-installed, including vi and nano.
You must have access to a Linux or UNIX shell. If you are running Windows, use a Linux environment like Cygwin as your command line tool for interacting with Cumulus Linux.
Get Started
Cumulus Linux is installed on the switch by default. To upgrade to a different Cumulus Linux release or re-install Cumulus Linux, refer to Installation Management. To show the Cumulus Linux release installed on the switch, run the NVUE nv show system command.
When starting Cumulus Linux for the first time, the management port makes a DHCPv4 request. To determine the IP address of the switch, you can cross reference the MAC address of the switch with your DHCP server. The MAC address is typically located on the side of the switch or on the box in which the unit ships.
To get started:
Log in to Cumulus Linux on the switch and change the default credentials.
Configure Cumulus Linux. This quick start guide provides instructions on changing the hostname of the switch, setting the date and time, and configuring switch ports and a loopback interface.
You can choose to configure Cumulus Linux either with NVUE commands or Linux commands (with vtysh or by manually editing configuration files). Do not run both NVUE configuration commands (such as nv set, nv unset, nv action, and nv config) and Linux commands to configure the switch. NVUE commands replace the configuration in files such as /etc/network/interfaces and /etc/frr/frr.conf, and remove any configuration you add manually or with automation tools like Ansible, Chef, or Puppet.
If you choose to configure Cumulus Linux with NVUE, you can configure features that do not yet support the NVUE Object Model by creating snippets. See NVUE Snippets.
Login Credentials
The default installation includes two accounts:
The system account (root) has full system privileges. Cumulus Linux locks the root account password by default (which prohibits login).
The user account (cumulus) has sudo privileges. The cumulus account uses the default password cumulus.
When you log in for the first time with the cumulus account, Cumulus Linux prompts you to change the default password. After you provide a new password, the SSH session disconnects and you have to reconnect with the new password.
In this quick start guide, you use the cumulus account to configure Cumulus Linux.
All accounts except root can use remote SSH login; you can use sudo to grant a non-root account root-level access. Commands that change the system configuration require this elevated level of access.
NVIDIA recommends you perform management and configuration over the network, either in band or out of band. A serial console is fully supported.
Typically, switches ship from the manufacturer with a mating DB9 serial cable. Switches with ONIE are always set to a 115200 baud rate.
Wired Ethernet Management
A Cumulus Linux switch always provides at least one dedicated Ethernet management port called eth0. This interface is specifically for out-of-band management use. The management interface uses DHCPv4 for addressing by default.
To set a static IP address:
cumulus@switch:~$ nv set interface eth0 ip address 192.0.2.42/24
cumulus@switch:~$ nv set interface eth0 ip gateway 192.0.2.1
cumulus@switch:~$ nv config apply
The command prompt in the terminal does not reflect the new hostname until you either log out of the switch or start a new shell.
Configure the Time Zone
The default time zone on the switch is UTC (Coordinated Universal Time). Change the time zone on your switch to be the time zone for your location.
To update the time zone:
Run the nv set system timezone <timezone> command. To see all the available time zones, run nv set system timezone and press the Tab key. The following example sets the time zone to US/Eastern:
cumulus@switch:~$ nv set system timezone US/Eastern
cumulus@switch:~$ nv config apply
In a terminal, run the following command:
cumulus@switch:~$ sudo dpkg-reconfigure tzdata
Follow the on screen menu options to select the geographic area and region.
Programs that are already running (including log files) and logged in users, do not see time zone changes. To set the time zone for all services and daemons, reboot the switch.
Verify the System Time
Verify that the date and time on the switch are correct with the Linux date command:
cumulus@switch:~$ date
Mon 21 Nov 2022 06:30:37 PM UTC
If the date and time are incorrect, the switch does not synchronize with automation tools, such as Puppet, and returns errors after you restart switchd.
To set the software clock according to the configured time zone, run the Linux sudo date -s command; for example:
cumulus@switch:~$ sudo date -s "Tue Jan 26 00:37:13 2021"
NTP starts at boot by default on the switch and the NTP configuration includes default servers. To customize NTP, see NTP.
PTP is off by default on the switch. To configure PTP, see PTP.
Configure Breakout Ports with Splitter Cables
If you are using 4x10G DAC or AOC cables, or you want to break out 100G or 40G switch ports, configure the breakout ports. For more details, see Switch Port Attributes.
Test Cable Connectivity
By default, Cumulus Linux disables all data plane ports (every Ethernet port except the management interface, eth0). To test cable connectivity, administratively enable physical ports.
To administratively enable a port:
cumulus@switch:~$ nv set interface swp1
cumulus@switch:~$ nv config apply
To administratively enable all physical ports on a switch that has ports numbered from swp1 to swp52:
cumulus@switch:~$ nv set interface swp1-52
cumulus@switch:~$ nv config apply
To view link status, run the nv show interface command.
To administratively enable a port:
cumulus@switch:~$ sudo ip link set swp1 up
To administratively enable all physical ports, run the following bash script:
cumulus@switch:~$ sudo su -
cumulus@switch:~$ for i in /sys/class/net/*; do iface=`basename $i`; if [[ $iface == swp* ]]; then ip link set $iface up fi done
To view link status, run the ip link show command.
Configure Layer 2 Ports
Cumulus Linux does not put all ports into a bridge by default. To create a bridge and configure one or more front panel ports as members of the bridge:
The following configuration example places the front panel port swp1 into the default bridge called br_default.
The following configuration example places the front panel port swp1 into the default bridge called br_default:
...
auto br_default
iface br_default
bridge-ports swp1
...
To put a range of ports into a bridge, use the glob keyword. For example, to add swp1 through swp10, swp12, and swp14 through swp20 to the bridge called br_default:
You can configure a front panel port or bridge interface as a layer 3 port.
The following configuration example configures the front panel port swp1 as a layer 3 access port:
cumulus@switch:~$ nv set interface swp1 ip address 10.0.0.0/31
cumulus@switch:~$ nv config apply
To add an IP address to a bridge interface, you must put it into a VLAN interface. If you want to use a VLAN other than the native one, set the bridge PVID:
cumulus@switch:~$ nv set interface swp1-2 bridge domain br_default
cumulus@switch:~$ nv set bridge domain br_default vlan 10
cumulus@switch:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@switch:~$ nv set bridge domain br_default untagged 1
cumulus@switch:~$ nv config apply
The following configuration example configures the front panel port swp1 as a layer 3 access port:
auto swp1
iface swp1
address 10.0.0.0/31
To add an IP address to a bridge interface, include the address under the iface stanza in the /etc/network/interfaces file. If you want to use a VLAN other than the native one, set the bridge PVID:
If there are no errors, run the following command:
cumulus@switch:~$ sudo ifup -a
Configure a Loopback Interface
Cumulus Linux has a preconfigured loopback interface. When the switch boots up, the loopback interface, called lo, is up and assigned an IP address of 127.0.0.1.
The loopback interface lo must always exist on the switch and must always be up. To check the status of the loopback interface, run the NVUE nv show interface lo command or the Linux ip addr show lo command.
To add an IP address to a loopback interface, configure the lo interface:
cumulus@switch:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@switch:~$ nv config apply
Add the IP address directly under the iface lo inet loopback definition in the /etc network/interfaces file:
auto lo
iface lo inet loopback
address 10.10.10.1
If you configure an IP address without a subnet mask, it becomes a /32 IP address. For example, 10.10.10.1 is 10.10.10.1/32.
If you run NVUE Commands to configure the switch, run the nv config save command before you reboot. The command saves the applied configuration to the startup configuration so that the changes persist after the reboot.
cumulus@switch:~$ nv config save
Show Platform and System Settings
To show the hostname of the switch, the time zone, and the version of Cumulus Linux running on the switch, run the NVUE nv show system command.
To show switch platform information, such as the ASIC model, CPU, hard disk drive size, RAM size, and port layout, run the NVUE nv show platform hardware command.
Next Steps
You are now ready to configure the switch according to your needs. This guide provides separate sections that describe how to configure system, layer 1, layer 2, layer 3, and network virtualization settings. Each section includes example configurations and pre-built demos.
For a deep dive into the NVUE object model that provides a CLI to simplify configuration, see NVUE.
Installation Management
This section describes how to manage, install, and upgrade Cumulus Linux on your switch.
Managing Cumulus Linux Disk Images
The Cumulus Linux operating system resides on a switch as a disk image. This section discusses how to manage the image.
Reprovisioning the system deletes all system data from the switch.
To stage an ONIE installer from the network (where ONIE automatically locates the installer), run the onie-select -i command. You must reboot the switch to start the install process.
cumulus@switch:~$ sudo onie-select -i
WARNING:
WARNING: Operating System install requested.
WARNING: This will wipe out all system data.
WARNING:
Are you sure (y/N)? y
Enabling install at next reboot...done.
Reboot required to take effect.
To cancel a pending reinstall operation, run the onie-select -c command:
cumulus@switch:~$ sudo onie-select -c
Cancelling pending install at next reboot...done.
To stage an installer located in a specific location, run the onie-install -i <location> command. You can specify a local, absolute or relative path, an HTTP or HTTPS server, SCP or FTP server. You can also stage a Zero Touch Provisioning (ZTP) script along with the installer.
You typically use the onie-install command with the -a option to activate installation. If you do not specify the -a option, you must reboot the switch to start the installation process.
The following example stages the installer located at http://203.0.113.10/image-installer together with the ZTP script located at http://203.0.113.10/ztp-script and activates installation and ZTP:
You can also specify these options together in the same command. For example:
cumulus@switch:~$ sudo onie-install -i http://203.0.113.10/image-installer -z http://203.0.113.10/ztp-script -a
To see more onie-install options, run man onie-install.
Migrate from Cumulus Linux to ONIE (Uninstall All Images and Remove the Configuration)
To remove all installed images and configurations, and return the switch to its factory defaults, run the onie-select -k command.
The onie-select -k command takes a long time to run as it overwrites the entire NOS section of the flash. Only use this command if you want to erase all NOS data and take the switch out of service.
cumulus@switch:~$ sudo onie-select -k
WARNING:
WARNING: Operating System uninstall requested.
WARNING: This will wipe out all system data.
WARNING:
Are you sure (y/N)? y
Enabling uninstall at next reboot...done.
Reboot required to take effect.
You must reboot the switch to start the uninstallation process.
To cancel a pending uninstall operation, run the onie-select -c command:
cumulus@switch:~$ sudo onie-select -c
Cancelling pending uninstall at next reboot...done.
Boot Into Rescue Mode
If your system becomes unresponsive, you can correct certain issues by booting into ONIE rescue mode, which uses unmounted file systems. You can use various Cumulus Linux utilities to try and resolve a problem.
To reboot the system into ONIE rescue mode, run the onie-select -r command:
cumulus@switch:~$ sudo onie-select -r
WARNING:
WARNING: Rescue boot requested.
WARNING:
Are you sure (y/N)? y
Enabling rescue at next reboot...done.
Reboot required to take effect.
You must reboot the system to boot into rescue mode.
To cancel a pending rescue boot operation, run the onie-select -c command:
cumulus@switch:~$ sudo onie-select -c
Cancelling pending rescue at next reboot...done.
Inspect the Image File
The Cumulus Linux image file is executable. From a running switch, you can display, extract, and verify the contents of the image file.
To display the contents of the Cumulus Linux image file, pass the info option to the image file. For example, to display the contents of an image file called onie-installer located in the /var/lib/cumulus/installer directory:
To extract the contents of the image file, use with the extract <path> option. For example, to extract an image file called onie-installer located in the /var/lib/cumulus/installer directory to the mypath directory:
cumulus@switch:~$ sudo /var/lib/cumulus/installer/onie-installer extract mypath
total 181860
-rw-r--r-- 1 4000 4000 308 May 16 19:04 control
drwxr-xr-x 5 4000 4000 4096 Apr 26 21:28 embedded-installer
-rw-r--r-- 1 4000 4000 13273936 May 16 19:04 initrd
-rw-r--r-- 1 4000 4000 4239088 May 16 19:04 kernel
-rw-r--r-- 1 4000 4000 168701528 May 16 19:04 sysroot.tar
To verify the contents of the image file, use with the verify option. For example, to verify the contents of an image file called onie-installer located in the /var/lib/cumulus/installer directory:
cumulus@switch:~$ sudo /var/lib/cumulus/installer/onie-installer verify
Verifying image checksum ...OK.
Preparing image archive ... OK.
./cumulus-linux-bcm-amd64.bin.1: 161: ./cumulus-linux-bcm-amd64.bin.1: onie-sysinfo: not found
Verifying image compatibility ...OK.
Verifying system ram ...OK.
The default password for the cumulus user account is cumulus. The first time you log into Cumulus Linux, you must change this default password. Be sure to update any automation scripts before installing a new image. Cumulus Linux provides command line options to change the default password automatically during the installation process. Refer to ONIE Installation Options.
You can install a new Cumulus Linux image using ONIE, an open source project (equivalent to PXE on servers) that enables the installation of network operating systems (NOS) on bare metal switches.
Before you install Cumulus Linux, the switch can be in two different states:
The switch does not contain an image (the switch is only running ONIE).
Cumulus Linux is already on the switch but you want to use ONIE to reinstall Cumulus Linux or upgrade to a newer version.
The sections below describe some of the different ways you can install the Cumulus Linux image. Steps show how to install directly from ONIE (if no image is on the switch) and from Cumulus Linux (if the image is already on the switch). For additional methods to find and install the Cumulus Linux image, see the ONIE Design Specification.
Installing the Cumulus Linux image is destructive; configuration files on the switch are not saved; copy them to a different server before installing.
In the following procedures:
You can name your Cumulus Linux image using any of the
ONIE naming schemes mentioned here.
Run the sudo onie-install -h command to show the ONIE installer options.
Install Using a DHCP/Web Server With DHCP Options
To install Cumulus Linux using a DHCP or web server withDHCP options, set up a DHCP/web server on your laptop and connect the eth0 management port of the switch to your laptop. After you connect the cable, the installation proceeds as follows:
The switch boots up and requests an IP address (DHCP request).
The DHCP server acknowledges and responds with DHCP option 114 and the location of the installation image.
ONIE downloads the Cumulus Linux image, installs, and reboots.
You are now running Cumulus Linux.
The most common way is to send DHCP option 114 with the entire URL to the web server (this can be the same system). However, there are other ways you can use DHCP even if you do not have full control over DHCP. See the ONIE user guide for information on partial installer URLs and advanced DHCP options; both articles list more supported DHCP options.
Here is an example DHCP configuration with an ISC DHCP server:
Place the Cumulus Linux image in a directory on the web server.
From the Cumulus Linux command prompt, run the onie-install command, then reboot the switch.
cumulus@switch:~$ sudo onie-install -a -i http://10.0.1.251/path/to/cumulus-install-x86_64.bin
Install Using a Web Server With no DHCP
Follow the steps below if you can log into the switch on a serial console (ONIE), or you can log in on the console or with ssh (Install from Cumulus Linux) but no DHCP server is available.
You need a console connection to access the switch; you cannot perform this procedure remotely.
ONIE is in discovery mode. You must disable discovery mode with the following command:
onie# onie-discovery-stop
On older ONIE versions, if the onie-discovery-stop command is not supported, run:
onie# /etc/init.d/discover.sh stop
Assign a static address to eth0 with the ip addr add command:
ONIE:/ #ip addr add 10.0.1.252/24 dev eth0
Place the Cumulus Linux image in a directory on your web server.
Run the installer manually (because there are no DHCP options):
From the Cumulus Linux command prompt, run the onie-install command, then reboot the switch.
cumulus@switch:~$ sudo onie-install -a -i /path/to/local/file/cumulus-install-x86_64.bin
Install Using a USB Drive
Follow the steps below to install the Cumulus Linux image using a USB drive.
Installing Cumulus Linux using a USB drive is fine for a single switch here and there but is not scalable. DHCP can scale to hundreds of switch installs with zero manual input unlike USB installs.
From a computer, prepare your USB drive by formatting it using one of the supported formats: FAT32, vFAT or EXT2.
▼
Optional: Prepare a USB Drive inside Cumulus Linux
a. Insert your USB drive into the USB port on the switch running Cumulus Linux and log in to the switch. Examine output from cat /proc/partitions and sudo fdisk -l [device] to determine the location of your USB drive. For example, sudo fdisk -l /dev/sdb.
These instructions assume your USB drive is the /dev/sdb device, which is typical if you insert the USB drive after the machine is already booted. However, if you insert the USB drive during the boot process, it is possible that your USB drive is the /dev/sda device. Make sure to modify the commands below to use the proper device for your USB drive.
b. Create a new partition table on the USB drive. If the parted utility is not on the system, install it with sudo -E apt-get install parted.
sudo parted /dev/sdb mklabel msdos
c. Create a new partition on the USB drive:
sudo parted /dev/sdb -a optimal mkpart primary 0% 100%
d. Format the partition to your filesystem of choice using one of the examples below:
When using a MAC or Windows computer to rename the installation file, the file extension can still be present. Make sure you remove the file extension so that ONIE can detect the file.
Insert the USB drive into the switch, then prepare the switch for installation:
If the switch is offline, connect to the console and power on the switch.
If the switch is already online in ONIE, use the reboot command.
SSH sessions to the switch get dropped after this step. To complete the remaining instructions, connect to the console of the switch. Cumulus Linux switches display their boot process to the console; you need to monitor the console specifically to complete the next step.
Monitor the console and select the ONIE option from the first GRUB screen shown below.
Cumulus Linux on x86 uses GRUB chainloading to present a second GRUB menu specific to the ONIE partition. No action is necessary in this menu to select the default option ONIE: Install OS.
The switch recognizes the USB drive and mounts it automatically. Cumulus Linux installation begins.
After installation completes, the switch automatically reboots into the newly installed instance of Cumulus Linux.
ONIE Installation Options
You can run several installer command line options from ONIE to perform basic switch configuration automatically after installation completes and Cumulus Linux boots for the first time. These options enable you to:
Set a unique password for the cumulus user
Provide an initial network configuration
Execute a ZTP script to perform necessary configuration
The onie-nos-install command does not allow you to specify command line parameters. You must access the switch from the console and transfer a disk image to the switch. You must then make the disk image executable and install the image directly from the ONIE command line with the options you want to use.
The following example commands transfer a disk image to the switch, make the image executable, and install the image with the --password option to change the default cumulus user password:
You can run more than one option in the same command.
Set the cumulus User Password
The default cumulus user account password is cumulus. When you log into Cumulus Linux for the first time, you must provide a new password for the cumulus account, then log back into the system.
To automate this process, you can specify a new password from the command line of the installer with the --password '<clear text-password>' option. For example, to change the default cumulus user password to MyP4$$word:
To provide a hashed password instead of a clear text password, use the --hashed-password '<hash>' option. An encrypted hash maintains a secure management network.
Generate a sha-512 password hash with the following openssl command. The example command generates a sha-512 password hash for the password MyP4$$word.
If you specify both the --password and --hashed-password options, the --hashed-password option takes precedence and the switch ignores the --password option.
Provide Initial Network Configuration
To provide initial network configuration automatically when Cumulus Linux boots for the first time after installation, use the --interfaces-file <filename> option. For example, to copy the contents of a file called network.intf into the /etc/network/interfaces file and run the ifreload -a command:
To run a ZTP script that contains commands to execute after Cumulus Linux boots for the first time after installation, use the --ztp <filename> option. For example, to run a ZTP script called initial-conf.ztp:
The ZTP script must contain the CUMULUS-AUTOPROVISIONING string near the beginning of the file and must reside on the ONIE filesystem. Refer to Zero Touch Provisioning - ZTP.
If you use the --ztp option together with any of the other command line options, the ZTP script takes precedence and the switch ignores other command line options.
Change the Default BIOS Password
To provide a layer of security and to prevent unauthorized access to the switch, NVIDIA recommends you change the default BIOS password. The default BIOS password is admin.
To change the default BIOS password:
During system boot, press Ctrl+B through the serial console while the BIOS version prints.
From the Security menu, select Administrator Password.
Follow the prompts.
Edit the Cumulus Linux Image (Advanced)
The Cumulus Linux disk image file contains a BASH script that includes a set of variables. You can set these variables to be able to install a fully configured system with a single image file.
▼
To edit the image
Example Image File
The Cumulus Linux disk image file is a self-extracting executable. The executable part of the file is a BASH script at the beginning of the file. Towards the beginning of this BASH script are a set of variables with empty strings:
Defines the clear text password. This variable is equivalent to the ONIE installer command line option --password.
CL_INSTALLER_HASHED_PASSWORD
Defines the hashed password. This variable is equivalent to the ONIE installer command line option --hashed-password. If you set both the CL_INSTALLER_PASSWORD and CL_INSTALLER_HASHED_PASSWORD variable, the CL_INSTALLER_HASHED_PASSWORD takes precedence.
CL_INSTALLER_INTERFACES_FILENAME
Defines the name of the file on the ONIE filesystem you want to use as the /etc/network/interfaces file. This variable is equivalent to the ONIE installer command line option --interfaces-file.
CL_INSTALLER_INTERFACES_CONTENT
Describes the network interfaces available on your system and how to activate them. Setting this variable defines the contents of the /etc/network/interfaces file. There is no equivalent ONIE installer command line option. If you set both the CL_INSTALLER_INTERFACES_FILENAME and CL_INSTALLER_INTERFACES_CONTENT variables, the CL_INSTALLER_INTERFACES_FILENAME takes precedence.
CL_INSTALLER_ZTP_FILENAME
Defines the name of the ZTP file on the ONIE filesystem you want to execute at first boot after installation. This variable is equivalent to the ONIE installer command line option --ztp
Edit the Image File
Because the Cumulus Linux image file is a binary file, you cannot use standard text editors to edit the file directly. Instead, you must split the file into two parts, edit the first part, then put the two parts back together.
Copy the first 20 lines to an empty file:
head -20 cumulus-linux-4.4.0-mlx-amd64.bin > cumulus-linux-4.4.0-mlx-amd64.bin.1
Remove the first 20 lines of the image, then copy the remaining lines into another empty file:
sed -e '1,20d' cumulus-linux-4.4.0-mlx-amd64.bin > cumulus-linux-4.4.0-mlx-amd64.bin.2
The original file is now split, with the first 20 lines in cumulus-linux-4.4.0-mlx-amd64.bin.1 and the remaining lines in cumulus-linux-4.4.0-mlx-amd64.bin.2.
Use a text editor to change the variables in cumulus-linux-4.4.0-mlx-amd64.bin.1.
Calculate the new checksum and update the CL_INSTALLER_PAYLOAD_SHA256 variable. sed -e '1,/^exit_marker$/d' "cumulus-linux-4.4.0-mlx-amd64.bin.final" | sha256sum | awk '{ print $1 }'
This following example shows a modified image file:
...
CL_INSTALLER_PAYLOAD_SHA256='d14a028c2a3a2bc9476102bb288234c415a2b01f828ea62ac332e42f'
CL_INSTALLER_PASSWORD='MyP4$$word'
CL_INSTALLER_HASHED_PASSWORD=''
CL_INSTALLER_LICENSE='customer@datacenter.com|4C3YMCACDiK0D/EnrxlXpj71FBBNAg4Yrq+brza4ZtJFCInvalid'
CL_INSTALLER_INTERFACES_FILENAME=''
CL_INSTALLER_INTERFACES_CONTENT='# This file describes the network interfaces available on your system and how to activate them.
source /etc/network/interfaces.d/*.intf
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
auto eth0
iface eth0 inet dhcp
vrf mgmt
auto bridge
iface bridge
bridge-ports swp1 swp2
bridge-pvid 1
bridge-vids 10 11
bridge-vlan-aware yes
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
'
CL_INSTALLER_ZTP_FILENAME=''
...
You can install this edited image file in the usual way, by using the ONIE install waterfall or the onie-nos-install command.
If you install the modified installation image and specify installer command line parameters, the command line parameters take precedence over the variables modified in the image.
Secure Boot
Secure Boot validates each binary image loaded during system boot with key signatures that correspond to a stored trusted key in firmware.
Secure Boot is only on the NVIDIA SN3700C-S switch.
Secure Boot settings are in the BIOS Security menu. To access BIOS, press Ctrl+B through the serial console during system boot while the BIOS version prints:
To access the BIOS menu, use admin which is the default BIOS password:
NVIDIA recommends changing the default BIOS password; navigate to Security and select Administrator Password.
To validate or change the Secure Boot mode, navigate to Security and select Secure Boot:
In the Secure Boot menu, you can enable and disable Secure Boot mode. To install an unsigned version of Cumulus Linux or access ONIE without a prompt for a username and password, set Secure Boot to disabled:
To access ONIE when Secure Boot is enabled, authentication is necessary. The default username and password are both root:
ONIE: Rescue Mode ...
Platform : x86_64-mlnx_x86-r0
Version : 2021.02-5.3.0006-rc3-115200
Build Date: 2021-05-20T14:27+03:00
Info: Mounting kernel filesystems... done.
Info: Mounting ONIE-BOOT on /mnt/onie-boot ...
[ 17.011057] ext4 filesystem being mounted at /mnt/onie-boot supports timestamps until 2038 (0x7fffffff)
Info: Mounting EFI System on /boot/efi ...
Info: BIOS mode: UEFI
Info: Using eth0 MAC address: b8:ce:f6:3c:62:06
Info: eth0: Checking link... up.
Info: Trying DHCPv4 on interface: eth0
ONIE: Using DHCPv4 addr: eth0: 10.20.84.226 / 255.255.255.0
Starting: klogd... done.
Starting: dropbear ssh daemon... done.
Starting: telnetd... done.
discover: Rescue mode detected. Installer disabled.
Please press Enter to activate this console. To check the install status inspect /var/log/onie.log.
Try this: tail -f /var/log/onie.log
** Rescue Mode Enabled **
login: root
Password: root
ONIE:~ #
To validate the Secure Boot status of a system from Cumulus Linux, run the mokutil --sb-state command.
The default password for the cumulus user account is cumulus. The first time you log into Cumulus Linux, you must change this default password. Be sure to update any automation scripts before you upgrade. You can use ONIE command line options to change the default password automatically during the Cumulus Linux image installation process. Refer to ONIE Installation Options.
This topic describes how to upgrade Cumulus Linux on your switch.
Consider deploying, provisioning, configuring, and upgrading switches using automation, even with small networks or test labs. During the upgrade process, you can upgrade dozens of devices in a repeatable manner. Using tools like Ansible, Chef, or Puppet for configuration management greatly increases the speed and accuracy of the next major upgrade; these tools also enable you to quickly swap failed switch hardware.
Understanding the location of configuration data is important for successful upgrades, migrations, and backup. As with other Linux distributions, the /etc directory is the primary location for all configuration data in Cumulus Linux. The following list contains the files you need to back up and migrate to a new release. Make sure you examine any changed files. Make the following files and directories part of a backup strategy.
File Name and Location
Description
Cumulus Linux Documentation
Debian Documentation
/etc/frr/
Routing application (responsible for BGP and OSPF)
If you are using the root user account, consider including /root/.
If you have custom user accounts, consider including /home/<username>/.
Run the net show configuration files | grep -B 1 "===" command and back up the files listed in the command output.
File Name and Location
Description
/etc/mlx/
Per-platform hardware configuration directory, created on first boot. Do not copy.
/etc/default/clagd
Created and managed by ifupdown2. Do not copy.
/etc/default/grub
Grub init table. Do not modify manually.
/etc/default/hwclock
Platform hardware-specific file. Created during first boot. Do not copy.
/etc/init
Platform initialization files. Do not copy.
/etc/init.d/
Platform initialization files. Do not copy.
/etc/fstab
Static information on filesystem. Do not copy.
/etc/image-release
System version data. Do not copy.
/etc/os-release
System version data. Do not copy.
/etc/lsb-release
System version data. Do not copy.
/etc/lvm/archive
Filesystem files. Do not copy.
/etc/lvm/backup
Filesystem files. Do not copy.
/etc/modules
Created during first boot. Do not copy.
/etc/modules-load.d/
Created during first boot. Do not copy.
/etc/sensors.d
Platform-specific sensor data. Created during first boot. Do not copy.
/root/.ansible
Ansible tmp files. Do not copy.
/home/cumulus/.ansible
Ansible tmp files. Do not copy.
The following commands verify which files have changed compared to the previous Cumulus Linux install. Be sure to back up any changed files.
Run the sudo dpkg --verify command to show a list of changed files.
Run the egrep -v '^$|^#|=""$' /etc/default/isc-dhcp-* command to see if any of the generated /etc/default/isc-* files have changed.
Back Up and Restore Configuration with NVUE
To back up and restore the configuration on the switch with NVUE, you can either:
Back up and restore the NVUE configuration file (available when upgrading from 4.4 and later).
Back up and restore the NVUE configuration commands (available when upgrading from 5.0 and later).
You can back up and restore the configuration with NVUE only if you used NVUE commands to configure the switch you want to upgrade.
To back up and restore the configuration file:
Save the configuration to the /etc/nvue.d/startup.yaml file with the nv config save command:
cumulus@switch:~$ nv config save
saved
Copy the /etc/nvue.d/startup.yaml file off the switch to a different location.
After upgrade is complete, restore the configuration. Copy the /etc/nvue.d/startup.yaml file to the switch, then run the nv config apply startup command:
cumulus@switch:~$ nv config apply startup
applied
To back up and restore the configuration commands:
Run the nv config show -o commands > backup.config command to save the commands to the backup.config file:
cumulus@switch:~$ nv config show -o commands > backup.config
Copy the backup.config file off the switch to a different location.
After upgrade is complete, restore the configuration. Copy the backup.config file to the switch, then run the source backup.config command to run all the commands in the file.
cumulus@switch:~$ source backup.config
If the backup configuration contains an obfuscated password, you need to reconfigure the password after you run the source backup.config command; otherwise authentication fails.
Verify the configuration on the switch, then run the nv config save command to save the configuration to the /etc/nvue.d/startup.yaml file.
If NVUE introduces new syntax for the feature that a snippet configures, you must remove the snippet before upgrading.
Create a cl-support File
Before and after you upgrade the switch, run the cl-support script to create a cl-support archive file. The file is a compressed archive of useful information for troubleshooting. If you experience any issues during upgrade, you can send this archive file to the Cumulus Linux support team to investigate.
Create the cl-support archive file with the cl-support command:
cumulus@switch:~$ sudo cl-support
Copy the cl-support file off the switch to a different location.
After upgrade is complete, run the cl-support command again to create a new archive file:
cumulus@switch:~$ sudo cl-support
Upgrade Cumulus Linux
ONIE is an open source project (equivalent to PXE on servers) that enables the installation of network operating systems (NOS) on a bare metal switch.
You can upgrade Cumulus Linux in one of two ways:
Install a Cumulus Linux image of the new release, using ONIE.
Upgrade only the changed packages using the sudo -E apt-get update and sudo -E apt-get upgrade command.
Cumulus Linux also provides ISSU to upgrade an active switch with minimal disruption to the network. See ISSU.
To upgrade to Cumulus Linux 5.5.0 from Cumulus Linux 4.x or 3.x, you must install a disk image of the new release using ONIE. You cannot upgrade packages with the apt-get upgrade command.
Upgrading an MLAG pair requires additional steps. If you are using MLAG to dual connect two Cumulus Linux switches in your environment, follow the steps in Upgrade Switches in an MLAG Pair below to ensure a smooth upgrade.
Install a Cumulus Linux Image or Upgrade Packages?
The decision to upgrade Cumulus Linux by either installing a Cumulus Linux image or upgrading packages depends on your environment and your preferences. Here are some recommendations for each upgrade method.
Install a Cumulus Linux image if you are performing a rolling upgrade in a production environment and if you are using up-to-date and comprehensive automation scripts. This upgrade method enables you to choose the exact release to which you want to upgrade and is the only method available to upgrade your switch to a new release train (for example, from 4.4.3 to 5.5.0).
Be aware of the following when installing the Cumulus Linux image:
Installing a Cumulus Linux image is destructive; any configuration files on the switch are not saved; copy them to a different server before you start the Cumulus Linux image install.
You must move configuration data to the new OS using ZTP or automation while the OS is first booted, or soon afterwards using out-of-band management.
Merge conflicts with configuration file changes in the new release sometimes go undetected.
If configuration files do not restore correctly, you cannot ssh to the switch from in-band management. Use out-of-band connectivity (eth0 or console).
You must reinstall and reconfigure third-party applications after upgrade.
Run package upgrade if you are upgrading from Cumulus Linux 5.0.0 to a later 5.x release, or if you use third-party applications (package upgrade does not replace or remove third-party applications, unlike the Cumulus Linux image install).
Be aware of the following when upgrading packages:
You cannot upgrade the switch to a new release train. For example, you cannot upgrade the switch from 4.x to 5.x.
You can only use package upgrade to upgrade a switch with an image install to a maximum of two releases; for example, you can package upgrade a switch running the Cumulus Linux 5.3 image to 5.4 or 5.5 (5.3 plus two releases).
The sudo -E apt-get upgrade command might restart or stop services as part of the upgrade process.
The sudo -E apt-get upgrade command might disrupt core services by changing core service dependency packages.
After you upgrade, account UIDs and GIDs created by packages might be different on different switches, depending on the configuration and package installation history.
Cumulus Linux does not support the sudo -E apt-get dist-upgrade command. Be sure to use sudo -E apt-get upgrade when upgrading packages.
The supported upgrade path is the base image plus two. For example, if the starting image is Cumulus Linux 5.2, the latest release that package upgrade supports is Cumulus Linux 5.4 (5.2 + 2 = 5.4).
You can check the base image with the grep RELEASE /etc/image-release syntax.
Occasionally, a release contains a base OS upgrade and does not support package upgrade; release notes indicate when a release does not support package upgrade.
Cumulus Linux Image Install (ONIE)
ONIE is an open source project (equivalent to PXE on servers) that enables the installation of network operating systems (NOS) on a bare metal switch.
To upgrade the switch:
Back up the configurations off the switch.
Download the Cumulus Linux image.
Install the Cumulus Linux image with the onie-install -a -i <image-location> command, which boots the switch into ONIE. The following example command installs the image from a web server, then reboots the switch. There are additional ways to install the Cumulus Linux image, such as using FTP, a local file, or a USB drive. For more information, see Installing a New Cumulus Linux Image.
cumulus@switch:~$ sudo onie-install -a -i http://10.0.1.251/cumulus-linux-4.1.0-mlx-amd64.bin && sudo reboot
Restore the configuration files to the new release (restoring files with automation is not recommended).
Verify correct operation with the old configurations on the new release.
Reinstall third party applications and associated configurations.
Package Upgrade
NVUE deprecated the port split command options (2x10G, 2x25G, 2x40G, 2x50G, 2x100G, 2x200G, 4x10G, 4x25G, 4x50G, 4x100G, 8x50G) available in Cumulus Linux 5.3 and earlier. If you use NVUE to configure port breakout speeds in Cumulus 5.3 or earlier, NVUE automatically updates the configuration during upgrade to Cumulus Linux 5.5 and later to use the new format (2x, 4x, 8x).
Cumulus Linux continues to support the old port split format in the /etc/cumulus/ports.conf file; however NVIDIA recommends that you use the new format.
Cumulus Linux completely embraces the Linux and Debian upgrade workflow, where you use an installer to install a base image, then perform any upgrades within that release train with sudo -E apt-get update and sudo -E apt-get upgrade commands. Any packages that have changed after the base install get upgraded in place from the repository. All switch configuration files remain untouched, or in rare cases merged (using the Debian merge function) during the package upgrade.
When you use package upgrade to upgrade your switch, configuration data stays in place during the upgrade. If the new release updates a previously changed configuration file, the upgrade process prompts you to either specify the version you want to use or evaluate the differences.
To upgrade the switch using package upgrade:
Back up the configurations from the switch.
Fetch the latest update metadata from the repository.
cumulus@switch:~$ sudo -E apt-get update
Review potential upgrade issues (in some cases, upgrading new packages might also upgrade additional existing packages due to dependencies).
Upgrade all the packages to the latest distribution.
cumulus@switch:~$ sudo -E apt-get upgrade
If you do not need to reboot the switch after the upgrade completes, the upgrade ends, restarts all upgraded services, and logs messages in the /var/log/syslog file similar to the ones shown below. In the examples below, the process only upgrades the frr package.
Policy: Service frr.service action stop postponed
Policy: Service frr.service action start postponed
Policy: Restarting services: frr.service
Policy: Finished restarting services
Policy: Removed /usr/sbin/policy-rc.d
Policy: Upgrade is finished
If the upgrade process encounters changed configuration files that have new versions in the release to which you are upgrading, you see a message similar to this:
Configuration file '/etc/frr/daemons'
==> Modified (by you or by a script) since installation.
==> Package distributor has shipped an updated version.
What would you like to do about it ? Your options are:
Y or I : install the package maintainer's version
N or O : keep your currently-installed version
D : show the differences between the versions
Z : start a shell to examine the situation
The default action is to keep your current version.
*** daemons (Y/I/N/O/D/Z) [default=N] ?
To see the differences between the currently installed version and the new version, type D.
To keep the currently installed version, type N. The new package version installs with the suffix .dpkg-dist (for example, /etc/frr/daemons.dpkg-dist). When the upgrade completes and before you reboot, merge your changes with the changes from the newly installed file.
To install the new version, type I. Your currently installed version has the suffix .dpkg-old.
Cumulus Linux includes /etc/apt/sources.list in the cumulus-archive-keyring package. During upgrade, you must select if you want the new version from the package or the existing file.
When the upgrade is complete, you can search for the files with the sudo find / -mount -type f -name '*.dpkg-*' command.
If you see errors for expired GPG keys that prevent you from upgrading packages, follow the steps in Upgrading Expired GPG Keys.
Reboot the switch if the upgrade messages indicate that you need to perform a system restart.
cumulus@switch:~$ sudo -E apt-get upgrade
... upgrade messages here ...
*** Caution: Service restart prior to reboot could cause unpredictable behavior
*** System reboot required ***
cumulus@switch:~$ sudo reboot
Verify correct operation with the old configurations on the new version.
The first time you run the NVUE nv config apply command after upgrading to Cumulus Linux 5.5, NVUE might override certain existing configuration for features that are now configurable with NVUE. Immediately after you reboot the switch to complete the upgrade, NVIDIA recommends you either:
Package upgrade always updates to the latest available release in the Cumulus Linux repository. For example, if you are currently running Cumulus Linux 5.0.0 and run the sudo -E apt-get upgrade command on that switch, the packages upgrade to the latest releases in the latest 5.x release.
Because Cumulus Linux is a collection of different Debian Linux packages, be aware of the following:
The /etc/os-release and /etc/lsb-release files update to the currently installed Cumulus Linux release when you upgrade the switch using either package upgrade or Cumulus Linux image install. For example, if you run sudo -E apt-get upgrade and the latest Cumulus Linux release on the repository is 5.5.0, these two files display the release as 5.5.0 after the upgrade.
The /etc/image-release file updates only when you run a Cumulus Linux image install. Therefore, if you run a Cumulus Linux image install of Cumulus Linux 5.0.0, followed by a package upgrade to 5.5.0 using sudo -E apt-get upgrade, the /etc/image-release file continues to display Cumulus Linux 5.0.0, which is the originally installed base image.
Upgrade Switches in an MLAG Pair
If you are using MLAG to dual connect two switches in your environment, follow the steps below to upgrade the switches.
You must upgrade both switches in the MLAG pair to the same release of Cumulus Linux.
Only during the upgrade process does Cumulus Linux supports different software versions between MLAG peer switches. After you upgrade the first MLAG switch in the pair, run the clagctl showtimers command to monitor the init-delay timer. When the timer expires, make the upgraded MLAG switch the primary, then upgrade the peer to the same version of Cumulus Linux.
NVIDIA has not tested running different versions of Cumulus Linux on MLAG peer switches outside of the upgrade time period; you might see unexpected results.
Verify the switch is in the secondary role:
cumulus@switch:~$ nv show mlag
Shut down the core uplink layer 3 interfaces. The following example shuts down swp1:
cumulus@switch:~$ nv set interface swp1 link state down
cumulus@switch:~$ nv config apply
Shut down the peer link:
cumulus@switch:~$ nv set interface peerlink link state down
cumulus@switch:~$ nv config apply
To boot the switch into ONIE, run the onie-install -a -i <image-location> command. The following example command installs the image from a web server. There are additional ways to install the Cumulus Linux image, such as using FTP, a local file, or a USB drive. For more information, see Installing a New Cumulus Linux Image.
cumulus@switch:~$ sudo onie-install -a -i http://10.0.1.251/downloads/cumulus-linux-4.1.0-mlx-amd64.bin
To upgrade the switch with package upgrade instead of booting into ONIE, run the sudo -E apt-get update and sudo -E apt-get upgrade commands; see Package Upgrade.
Save the changes to the NVUE configuration from steps 2-3 and reboot the switch:
cumulus@switch:~$ nv config save
cumulus@switch:~$ nv action reboot system
If you installed a new image on the switch, restore the configuration files to the new release. If you performed an upgrade with apt, bring the uplink and peer link interfaces you shut down in steps 2-3 up:
cumulus@switch:~$ nv set interface swp1 link state up
cumulus@switch:~$ nv set interface peerlink link state down
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv config save
Verify STP convergence across both switches with the Linux mstpctl showall command. NVUE does not provide an equivalent command.
cumulus@switch:~$ mstpctl showall
Verify core uplinks and peer links are UP:
cumulus@switch:~$ nv show interface
Verify MLAG convergence:
cumulus@switch:~$ nv show mlag
Make this secondary switch the primary:
cumulus@switch:~$ nv set mlag priority 2084
Verify the other switch is now in the secondary role.
Repeat steps 2-9 on the new secondary switch.
Remove the priority 2048 and restore the priority back to 32768 on the current primary switch:
cumulus@switch:~$ nv set mlag priority 32768
Verify the switch is in the secondary role:
cumulus@switch:~$ clagctl status
Shut down the core uplink layer 3 interfaces:
cumulus@switch:~$ sudo ip link set <switch-port> down
Shut down the peer link:
cumulus@switch:~$ sudo ip link set peerlink down
To boot the switch into ONIE, run the onie-install -a -i <image-location> command. The following example command installs the image from a web server. There are additional ways to install the Cumulus Linux image, such as using FTP, a local file, or a USB drive. For more information, see Installing a New Cumulus Linux Image.
cumulus@switch:~$ sudo onie-install -a -i http://10.0.1.251/downloads/cumulus-linux-4.1.0-mlx-amd64.bin
To upgrade the switch with package upgrade instead of booting into ONIE, run the sudo -E apt-get update and sudo -E apt-get upgrade commands; see Package Upgrade.
Reboot the switch:
cumulus@switch:~$ sudo reboot
If you installed a new image on the switch, restore the configuration files to the new release.
Verify STP convergence across both switches:
cumulus@switch:~$ mstpctl showall
Verify that core uplinks and peer links are UP:
cumulus@switch:~$ ip addr show
Verify MLAG convergence:
cumulus@switch:~$ clagctl status
Make this secondary switch the primary:
cumulus@switch:~$ clagctl priority 2048
Verify the other switch is now in the secondary role.
Repeat steps 2-9 on the new secondary switch.
Remove the priority 2048 and restore the priority back to 32768 on the current primary switch:
cumulus@switch:~$ clagctl priority 32768
Roll Back a Cumulus Linux Installation
Even the most well planned and tested upgrades can result in unforeseen problems and sometimes the best solution is to roll back to the previous state. These main strategies require detailed planning and execution:
Flatten and rebuild. If the OS becomes unusable, you can use orchestration tools to reinstall the previous OS release from scratch and then rebuild the configuration automatically.
Restore to a previous state using a backup configuration captured before the upgrade.
The method you employ is specific to your deployment strategy. Providing detailed steps for each scenario is outside the scope of this document.
Third Party Packages
If you install any third party applications on a Cumulus Linux switch, configuration data is typically installed in the /etc directory, but it is not guaranteed. It is your responsibility to understand the behavior and configuration file information of any third party packages installed on the switch.
After you upgrade using a full Cumulus Linux image install, you need to reinstall any third party packages or any Cumulus Linux add-on packages.
To manage additional applications in the form of packages and to install the latest updates, use the Advanced Packaging Tool (apt).
Updating, upgrading, and installing packages with apt causes disruptions to network services:
Upgrading a package can cause services to restart or stop.
Installing a package sometimes disrupts core services by changing core service dependency packages. In some cases, installing new packages also upgrades additional existing packages due to dependencies.
If services stop, you need to reboot the switch to restart the services.
Update the Package Cache
To work correctly, apt relies on a local cache listing of the available packages. You must populate the cache initially, then periodically update it with sudo -E apt-get update:
Use the -E option with sudo whenever you run any apt-get command. This option preserves your environment variables (such as HTTP proxies) before you install new packages or upgrade your distribution.
List Available Packages
After the cache populates, use the apt-cache command to search the cache and find the packages of interest or to get information about an available package.
Here are examples of the search and show sub-commands:
cumulus@switch:~$ apt-cache search tcp
collectd-core - statistics collection and monitoring daemon (core system)
fakeroot - tool for simulating superuser privileges
iperf - Internet Protocol bandwidth measuring tool
iptraf-ng - Next Generation Interactive Colorful IP LAN Monitor
libfakeroot - tool for simulating superuser privileges - shared libraries
libfstrm0 - Frame Streams (fstrm) library
libibverbs1 - Library for direct userspace use of RDMA (InfiniBand/iWARP)
libnginx-mod-stream - Stream module for Nginx
libqt4-network - Qt 4 network module
librtr-dev - Small extensible RPKI-RTR-Client C library - development files
librtr0 - Small extensible RPKI-RTR-Client C library
libwiretap8 - network packet capture library -- shared library
libwrap0 - Wietse Venema's TCP wrappers library
libwrap0-dev - Wietse Venema's TCP wrappers library, development files
netbase - Basic TCP/IP networking system
nmap-common - Architecture independent files for nmap
nuttcp - network performance measurement tool
openssh-client - secure shell (SSH) client, for secure access to remote machines
openssh-server - secure shell (SSH) server, for secure access from remote machines
openssh-sftp-server - secure shell (SSH) sftp server module, for SFTP access from remote machines
python-dpkt - Python 2 packet creation / parsing module for basic TCP/IP protocols
rsyslog - reliable system and kernel logging daemon
socat - multipurpose relay for bidirectional data transfer
tcpdump - command-line network traffic analyzer
cumulus@switch:~$ apt-cache show tcpdump
Package: tcpdump
Version: 4.9.3-1~deb10u1
Installed-Size: 1109
Maintainer: Romain Francoise <rfrancoise@debian.org>
Architecture: amd64
Replaces: apparmor-profiles-extra (<< 1.12~)
Depends: libc6 (>= 2.14), libpcap0.8 (>= 1.5.1), libssl1.1 (>= 1.1.0)
Suggests: apparmor (>= 2.3)
Breaks: apparmor-profiles-extra (<< 1.12~)
Size: 400060
SHA256: 3a63be16f96004bdf8848056f2621fbd863fadc0baf44bdcbc5d75dd98331fd3
SHA1: 2ab9f0d2673f49da466f5164ecec8836350aed42
MD5sum: 603baaf914de63f62a9f8055709257f3
Description: command-line network traffic analyzer
This program allows you to dump the traffic on a network. tcpdump
is able to examine IPv4, ICMPv4, IPv6, ICMPv6, UDP, TCP, SNMP, AFS
BGP, RIP, PIM, DVMRP, IGMP, SMB, OSPF, NFS and many other packet
types.
.
It can be used to print out the headers of packets on a network
interface, filter packets that match a certain expression. You can
use this tool to track down network problems, to detect attacks
or to monitor network activities.
Description-md5: f01841bfda357d116d7ff7b7a47e8782
Homepage: http://www.tcpdump.org/
Multi-Arch: foreign
Section: net
Priority: optional
Filename: pool/upstream/t/tcpdump/tcpdump_4.9.3-1~deb10u1_amd64.deb
The search commands look for the search terms not only in the package name but in other parts of the package information; the search matches on more packages than you expect.
List Packages Installed on the System
The apt-cache command shows information about all the packages available in the repository. To see which packages are actually installed on your system with the version, run the following command.
cumulus@switch:~$ nv show platform software installed
description package version
------------------------------------- ---------------------------------------------------------------------------------------------------------------------------- ------------------------------------- ----------------------------------------------
acpi displays information on ACPI devices acpi 1.7-1.1
acpi-support-base scripts for handling base ACPI events such as the power button acpi-support-base 0.142-8
acpid Advanced Configuration and Power Interface event daemon acpid 1:2.0.31-1
...
cumulus@switch:~$ dpkg -l
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===================-=========================-============-=================================
ii acpi 1.7-1.1 amd64 displays information on ACPI devices
ii acpi-support-base 0.142-8 all scripts for handling base ACPI events such as th
ii acpid 1:2.0.31-1 amd64 Advanced Configuration and Power Interface event
ii adduser 3.118 all add and remove users and groups
ii apt 1.8.2 amd64 commandline package manager
ii arping 2.19-6 amd64 sends IP and/or ARP pings (to the MAC address)
ii arptables 0.0.4+snapshot20181021-4 amd64 ARP table administration
...
Show the Version of a Package
To show the version of a specific package installed on the system:
The following example command shows which version of the vrf package is on the system:
cumulus@switch:~$ nv show platform software installed vrf
running applied pending description
----------- ------------------- ------- ------- -----------
description Linux tools for VRF Description
package vrf Package
version 1.0-cl4.4.0u0 Version
The following example command shows which version of the vrf package is on the system:
cumulus@switch:~$ dpkg -l vrf
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-==========-============-============-=================================
ii vrf 1.0-cl4.4.0u0 amd64 Linux tools for VRF
Upgrade Packages
To upgrade all the packages installed on the system to their latest versions, run the following commands:
The system lists the packages for upgrade and prompts you to continue.
The above commands upgrade all installed versions with their latest versions but do not install any new packages.
Add New Packages
To add a new package, first ensure the package is not already on the system:
cumulus@switch:~$ dpkg -l | grep <name of package>
If the package is already on the system, you can update the package from the Cumulus Linux repository as part of the package upgrade process, which upgrades all packages on the system. See Upgrade Packages above.
If the package is not already on the system, add it by running sudo -E apt-get install <name of package>. This retrieves the package from the Cumulus Linux repository and installs it on your system together with any other dependent packages. The following example adds the tcpreplay package to the system:
cumulus@switch:~$ sudo -E apt-get update
cumulus@switch:~$ sudo -E apt-get install tcpreplay
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
tcpreplay
0 upgraded, 1 newly installed, 0 to remove and 1 not upgraded.
Need to get 436 kB of archives.
After this operation, 1008 kB of additional disk space will be used
...
You can install several packages at the same time:
In some cases, installing a new package also upgrades additional existing packages due to dependencies. To view these additional packages before you install, run the apt-get install --dry-run command.
Add Packages From Another Repository
As shipped, Cumulus Linux searches the Cumulus Linux repository for available packages. You can add additional repositories to search by adding them to the list of sources that apt-get consults. See man sources.list for more information.
NVIDIA adds features or makes bug fixes to certain packages; do not replace these packages with versions from other repositories.
If you want to install packages that are not in the Cumulus Linux repository, the procedure is the same as above, but with one additional step.
NVIDIA does not test and Cumulus Linux Technical Support does not support packages that are not part of the Cumulus Linux repository.
Installing packages outside of the Cumulus Linux repository requires the use of sudo -E apt-get; however, depending on the package, you can use easy-install and other commands.
To install a new package, complete the following steps:
Run the dpkg command to ensure that the package is not already
installed on the system:
cumulus@switch:~$ dpkg -l | grep <name of package>
If the package is already on the system, ensure it is the version you need. If it is an older version, update the package from the Cumulus Linux repository:
If the package is not on the system, the package source location is not in the /etc/apt/sources.list file. Edit and add the appropriate source to the file. For example, add the following if you want a package from the Debian repository that is not in the Cumulus Linux repository:
deb http://http.us.debian.org/debian buster main
deb http://security.debian.org/ buster/updates main
Otherwise, /etc/apt/sources.list lists the repository but comments it out. To uncomment the repository, remove the # at the start of the line, then save the file.
Run sudo -E apt-get update, then install the package and upgrade:
Cumulus Linux contains a local archive embedded in the Cumulus Linux image. This archive, cumulus-local-apt-archive, contains the packages you need to install ifplugd, LDAP, RADIUS or TACACS+ without a network connection.
The archive contains the following packages:
audisp-tacplus
ifplugd
libdaemon0
libnss-ldapd
libnss-mapuser
libnss-tacplus
libpam-ldapd
libpam-radius-auth
libpam-tacplus
libtac2
libtacplus-map1
nslcd
Add these packages with apt-get update && apt-get install, as described above.
man pages for apt-get, dpkg, sources.list, apt_preferences
Zero Touch Provisioning - ZTP
Use ZTP to deploy network devices in large-scale environments. On first boot, Cumulus Linux runs ZTP, which executes the provisioning automation that deploys the device for its intended role in the network.
The provisioning framework allows you to execute a one-time, user-provided script. You can develop this script using a variety of automation tools and scripting languages. You can also use it to add the switch to a configuration management (CM) platform such as Puppet, Chef, CFEngine or a custom, proprietary tool.
While developing and testing the provisioning logic, you can use the ztp command in Cumulus Linux to run your provisioning script manually on a device.
ZTP in Cumulus Linux can run automatically in one of the following ways, in this order:
Through a local file
Using a USB drive inserted into the switch (ZTP-USB)
Through DHCP
Use a Local File
ZTP only looks one time for a ZTP script on the local file system when the switch boots. ZTP searches for an install script that matches an ONIE-style waterfall in /var/lib/cumulus/ztp, looking for the most specific name first, and ending at the most generic:
You can also trigger the ZTP process manually by running the ztp --run <URL> command, where the URL is the path to the ZTP script.
Use a USB Drive
NVIDIA tests this feature only with thumb drives, not an external large USB hard drive.
If the ztp process does not discover a local script, it tries one time to locate an inserted but unmounted USB drive. If it discovers one, it begins the ZTP process.
Cumulus Linux supports the use of a FAT32, FAT16, or VFAT-formatted USB drive as an installation source for ZTP scripts. You must plug in the USB drive before you power up the switch.
At minimum, the script must:
Install the Cumulus Linux operating system.
Copy over a basic configuration to the switch.
Restart the switch or the relevant services to get switchd up and running with that configuration.
Follow these steps to perform ZTP using a USB drive:
Copy the installation image to the USB drive.
The ztp process searches the root filesystem of the newly mounted drive for filenames matching an ONIE-style waterfall (see the patterns and examples above), looking for the most specific name first, and ending at the most generic.
ZTP parses the contents of the script to ensure it contains the CUMULUS-AUTOPROVISIONING flag (see example scripts).
The USB drive mounts to a temporary directory under /tmp (for example, /tmp/tmpigGgjf/). To reference files on the USB drive, use the environment variable ZTP_USB_MOUNTPOINT to refer to the USB root partition.
ZTP Over DHCP
If the ztp process does not discover a local ONIE script or applicable USB drive, it checks DHCP every ten seconds for up to five minutes for the presence of a ZTP URL specified in /var/run/ztp.dhcp. The URL can be any of HTTP, HTTPS, FTP, or TFTP.
For ZTP using DHCP, provisioning initially takes place over the management network and initiates through a DHCP hook. A DHCP option specifies a configuration script. The ZTP process requests this script from the Web server and the script executes locally.
The ZTP process over DHCP follows these steps:
The first time you boot Cumulus Linux, eth0 makes a DHCP request. By default, Cumulus Linux sends DHCP option 60 (the vendor class identifier) with the value cumulus-linux x86_64 to identify itself to the DHCP server.
The DHCP server offers a lease to the switch.
If option 239 is in the response, the ZTP process starts.
The ZTP process requests the contents of the script from the URL, sending additional HTTP headers containing details about the switch.
ZTP parses the contents of the script to ensure it contains the CUMULUS-AUTOPROVISIONING flag (see example scripts).
If provisioning is necessary, the script executes locally on the switch with root privileges.
ZTP examines the return code of the script. If the return code is 0, ZTP marks the provisioning state as complete in the autoprovisioning configuration file.
Trigger ZTP Over DHCP
If you have not yet provisioned the switch, you can trigger the ZTP process over DHCP when eth0 uses DHCP and one of the following events occur:
The switch boots.
You plug a cable into or unplug a cable from the eth0 port.
You disconnect, then reconnect the switch power cord.
You can also run the ztp --run <URL> command, where the URL is the path to the ZTP script.
Configure the DHCP Server
During the DHCP process over eth0, Cumulus Linux requests DHCP option 239. This option specifies the custom provisioning script.
For example, the /etc/dhcp/dhcpd.conf file for an ISC DHCP server looks like:
Do not use an underscore (_) in the hostname; underscores are not permitted in hostnames.
DHCP on Front Panel Ports
ZTP runs DHCP on all the front panel switch ports and on any active interface. ZTP assesses the list of active ports on every retry cycle. When it receives the DHCP lease and option 239 is present in the response, ZTP starts to execute the script.
Inspect HTTP Headers
The following HTTP headers in the request to the web server retrieve the provisioning script:
Header Value Example
------ ----- -------
User-Agent CumulusLinux-AutoProvision/0.4
CUMULUS-ARCH CPU architecture x86_64
CUMULUS-BUILD 5.1.0
CUMULUS-MANUFACTURER odm
CUMULUS-PRODUCTNAME switch_model
CUMULUS-SERIAL XYZ123004
CUMULUS-BASE-MAC 44:38:39:FF:40:94
CUMULUS-MGMT-MAC 44:38:39:FF:00:00
CUMULUS-VERSION 5.1.0
CUMULUS-PROV-COUNT 0
CUMULUS-PROV-MAX 32
Write ZTP Scripts
You must include the following line in any of the supported scripts that you expect to run using the autoprovisioning framework.
# CUMULUS-AUTOPROVISIONING
The script must contain the CUMULUS-AUTOPROVISIONING flag. You can include this flag in a comment or remark; you do not need to echo or write the flag to stdout.
You can write the script in any language that Cumulus Linux supports, such as:
Perl
Python
Ruby
Shell
The script must return an exit code of 0 upon success to mark the process as complete in the autoprovisioning configuration file.
The following script installs Cumulus Linux from a USB drive and applies a configuration:
#!/bin/bash
function error() {
echo -e "\e[0;33mERROR: The ZTP script failed while running the command $BASH_COMMAND at line $BASH_LINENO.\e[0m" >&2
exit 1
}
# Log all output from this script
exec >> /var/log/autoprovision 2>&1
date "+%FT%T ztp starting script $0"
trap error ERR
#Add Debian Repositories
echo "deb http://http.us.debian.org/debian buster main" >> /etc/apt/sources.list
echo "deb http://security.debian.org/ buster/updates main" >> /etc/apt/sources.list
#Update Package Cache
apt-get update -y
#Load interface config from usb
cp ${ZTP_USB_MOUNTPOINT}/interfaces /etc/network/interfaces
#Load port config from usb
# (if breakout cables are used for certain interfaces)
cp ${ZTP_USB_MOUNTPOINT}/ports.conf /etc/cumulus/ports.conf
#Reload interfaces to apply loaded config
ifreload -a
# CUMULUS-AUTOPROVISIONING
exit 0
Continue Provisioning
Typically ZTP exits after executing the script locally and does not continue. To continue with provisioning so that you do not have to intervene manually or embed an Ansible callback into the script, you can add the CUMULUS-AUTOPROVISION-CASCADE directive.
Best Practices
ZTP scripts come in different forms and frequently perform the same tasks. As BASH is the most common language for ZTP scripts, use the following BASH snippets to perform common tasks with robust error checking.
Set the Default Cumulus User Password
The default cumulus user account password is cumulus. When you log into Cumulus Linux for the first time, you must provide a new password for the cumulus account, then log back into the system.
Add the following function to your ZTP script to change the default cumulus user account password to a clear-text password. The example changes the password cumulus to MyP4$$word.
function set_password(){
# Unexpire the cumulus account
passwd -x 99999 cumulus
# Set the password
echo 'cumulus:MyP4$$word' | chpasswd
}
set_password
If you have an insecure management network, set the password with an encrypted hash instead of a clear-text password.
First, generate a sha-512 password hash with the following python commands. The example commands generate a sha-512 password hash for the password MyP4$$word.
Then, add the following function to the ZTP script to change the default cumulus user account password:
function set_password(){
# Unexpire the cumulus account
passwd -x 99999 cumulus
# Set the password
usermod -p '$6$hs7OPmnrfvLNKfoZ$iB3hy5N6Vv6koqDmxixpTO6lej6VaoKGvs5E8p5zNo4tPec0KKqyQnrFMII3jGxVEYWntG9e7Z7DORdylG5aR/' cumulus
}
set_password
Test DNS Name Resolution
DNS names are frequently used in ZTP scripts. The ping_until_reachable function tests that each DNS name resolves into a reachable IP address. Call this function with each DNS target used in your script before you use the DNS name elsewhere in your script.
The following example shows how to call the ping_until_reachable function in the context of a larger task.
function ping_until_reachable(){
last_code=1
max_tries=30
tries=0
while [ "0" != "$last_code" ] && [ "$tries" -lt "$max_tries" ]; do
tries=$((tries+1))
echo "$(date) INFO: ( Attempt $tries of $max_tries ) Pinging $1 Target Until Reachable."
ping $1 -c2 &> /dev/null
last_code=$?
sleep 1
done
if [ "$tries" -eq "$max_tries" ] && [ "$last_code" -ne "0" ]; then
echo "$(date) ERROR: Reached maximum number of attempts to ping the target $1 ."
exit 1
fi
}
Check the Cumulus Linux Release
The following script segment demonstrates how to check which Cumulus Linux release is running and upgrades the node if the release is not the target release. If the release is the target release, normal ZTP tasks execute. This script calls the ping_until_reachable script (described above) to make sure the server holding the image server and the ZTP script is reachable.
If you apply a management VRF in your script, either apply it last or reboot instead. If you do not apply a management VRF last, you need to prepend any commands that require eth0 to communicate out with /usr/bin/ip vrf exec mgmt; for example, /usr/bin/ip vrf exec mgmt apt-get update -y.
Perform Ansible Provisioning Callbacks
After initially configuring a node with ZTP, use Provisioning Callbacks to inform Ansible Tower or AWX that the node is ready for more detailed provisioning. The following example demonstrates how to use a provisioning callback:
Make sure to disable the DHCP hostname override setting in your script.
function set_hostname(){
# Remove DHCP Setting of Hostname
sed s/'SETHOSTNAME="yes"'/'SETHOSTNAME="no"'/g -i /etc/dhcp/dhclient-exit-hooks.d/dhcp-sethostname
hostnamectl set-hostname $1
}
Test ZTP Scripts
Use these commands to test and debug your ZTP scripts.
You can use verbose mode to debug your script and see where your script fails. Include the -v option when you run ZTP:
cumulus@switch:~$ sudo ztp -v -r http://192.0.2.1/demo.sh
Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh
Broadcast message from root@dell-s6010-01 (ttyS0) (Tue May 10 22:44:17 2016):
ZTP: Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh
ZTP Manual: URL response code 200
ZTP Manual: Found Marker CUMULUS-AUTOPROVISIONING
ZTP Manual: Executing http://192.0.2.1/demo.sh
error: ZTP Manual: Payload returned code 1
error: Script returned failure
To see results of the most recent ZTP execution, you can run the ztp -s command.
cumulus@switch:~$ ztp -s
ZTP INFO:
State enabled
Version 1.0
Result Script Failure
Date Mon 20 May 2019 09:31:27 PM UTC
Method ZTP DHCP
URL http://192.0.2.1/demo.sh
If ZTP runs when the switch boots and not manually, you can run the systemctl -l status ztp.service then journalctl -l -u ztp.service to see if any failures occur:
cumulus@switch:~$ sudo systemctl -l status ztp.service
● ztp.service - Cumulus Linux ZTP
Loaded: loaded (/lib/systemd/system/ztp.service; enabled)
Active: failed (Result: exit-code) since Wed 2016-05-11 16:38:45 UTC; 1min 47s ago
Docs: man:ztp(8)
Process: 400 ExecStart=/usr/sbin/ztp -b (code=exited, status=1/FAILURE)
Main PID: 400 (code=exited, status=1/FAILURE)
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Device not found
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Looking for ZTP Script provided by DHCP
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: Attempting to provision via ZTP DHCP from http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: URL response code 200
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Found Marker CUMULUS-AUTOPROVISIONING
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Executing http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Payload returned code 1
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: Script returned failure
May 11 16:38:45 dell-s6010-01 systemd[1]: ztp.service: main process exited, code=exited, status=1/FAILURE
May 11 16:38:45 dell-s6010-01 systemd[1]: Unit ztp.service entered failed state.
cumulus@switch:~$
cumulus@switch:~$ sudo journalctl -l -u ztp.service --no-pager
-- Logs begin at Wed 2016-05-11 16:37:42 UTC, end at Wed 2016-05-11 16:40:39 UTC. --
May 11 16:37:45 cumulus ztp[400]: ztp [400]: /var/lib/cumulus/ztp: Sate Directory does not exist. Creating it...
May 11 16:37:45 cumulus ztp[400]: ztp [400]: /var/run/ztp.lock: Lock File does not exist. Creating it...
May 11 16:37:45 cumulus ztp[400]: ztp [400]: /var/lib/cumulus/ztp/ztp_state.log: State File does not exist. Creating it...
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Looking for ZTP local Script
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6010_s1220-rUNKNOWN
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6010_s1220
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Looking for unmounted USB devices
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Parsing partitions
May 11 16:37:45 cumulus ztp[400]: ztp [400]: ZTP USB: Device not found
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Looking for ZTP Script provided by DHCP
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: Attempting to provision via ZTP DHCP from http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: URL response code 200
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Found Marker CUMULUS-AUTOPROVISIONING
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Executing http://192.0.2.1/demo.sh
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: ZTP DHCP: Payload returned code 1
May 11 16:38:45 dell-s6010-01 ztp[400]: ztp [400]: Script returned failure
May 11 16:38:45 dell-s6010-01 systemd[1]: ztp.service: main process exited, code=exited, status=1/FAILURE
May 11 16:38:45 dell-s6010-01 systemd[1]: Unit ztp.service entered failed state.
Instead of running journalctl, you can see the log history by running:
cumulus@switch:~$ cat /var/log/syslog | grep ztp
2016-05-11T16:37:45.132583+00:00 cumulus ztp [400]: /var/lib/cumulus/ztp: State Directory does not exist. Creating it...
2016-05-11T16:37:45.134081+00:00 cumulus ztp [400]: /var/run/ztp.lock: Lock File does not exist. Creating it...
2016-05-11T16:37:45.135360+00:00 cumulus ztp [400]: /var/lib/cumulus/ztp/ztp_state.log: State File does not exist. Creating it...
2016-05-11T16:37:45.185598+00:00 cumulus ztp [400]: ZTP LOCAL: Looking for ZTP local Script
2016-05-11T16:37:45.485084+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6010_s1220-rUNKNOWN
2016-05-11T16:37:45.486394+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell_s6010_s1220
2016-05-11T16:37:45.488385+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64-dell
2016-05-11T16:37:45.489665+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp-x86_64
2016-05-11T16:37:45.490854+00:00 cumulus ztp [400]: ZTP LOCAL: Waterfall search for /var/lib/cumulus/ztp/cumulus-ztp
2016-05-11T16:37:45.492296+00:00 cumulus ztp [400]: ZTP USB: Looking for unmounted USB devices
2016-05-11T16:37:45.493525+00:00 cumulus ztp [400]: ZTP USB: Parsing partitions
2016-05-11T16:37:45.636422+00:00 cumulus ztp [400]: ZTP USB: Device not found
2016-05-11T16:38:43.372857+00:00 cumulus ztp [1805]: Found ZTP DHCP Request
2016-05-11T16:38:45.696562+00:00 cumulus ztp [400]: ZTP DHCP: Looking for ZTP Script provided by DHCP
2016-05-11T16:38:45.698598+00:00 cumulus ztp [400]: Attempting to provision via ZTP DHCP from http://192.0.2.1/demo.sh
2016-05-11T16:38:45.816275+00:00 cumulus ztp [400]: ZTP DHCP: URL response code 200
2016-05-11T16:38:45.817446+00:00 cumulus ztp [400]: ZTP DHCP: Found Marker CUMULUS-AUTOPROVISIONING
2016-05-11T16:38:45.818402+00:00 cumulus ztp [400]: ZTP DHCP: Executing http://192.0.2.1/demo.sh
2016-05-11T16:38:45.834240+00:00 cumulus ztp [400]: ZTP DHCP: Payload returned code 1
2016-05-11T16:38:45.835488+00:00 cumulus ztp [400]: Script returned failure
2016-05-11T16:38:45.876334+00:00 cumulus systemd[1]: ztp.service: main process exited, code=exited, status=1/FAILURE
2016-05-11T16:38:45.879410+00:00 cumulus systemd[1]: Unit ztp.service entered failed state.
If you see that the issue is a script failure, you can modify the script and then run ZTP manually using ztp -v -r <URL/path to that script>, as above.
cumulus@switch:~$ sudo ztp -v -r http://192.0.2.1/demo.sh
Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh
Broadcast message from root@dell-s6010-01 (ttyS0) (Tue May 10 22:44:17 2019):
ZTP: Attempting to provision via ZTP Manual from http://192.0.2.1/demo.sh
ZTP Manual: URL response code 200
ZTP Manual: Found Marker CUMULUS-AUTOPROVISIONING
ZTP Manual: Executing http://192.0.2.1/demo.sh
error: ZTP Manual: Payload returned code 1
error: Script returned failure
cumulus@switch:~$ sudo ztp -s
State enabled
Version 1.0
Result Script Failure
Date Mon 20 May 2019 09:31:27 PM UTC
Method ZTP Manual
URL http://192.0.2.1/demo.sh
Use the following command to check syslog for information about ZTP:
Errors in syslog for ZTP like those shown above often occur if you create or edit the script on a Windows machine. Check to make sure that the \r\n characters are not present in the end-of-line encodings.
Use the cat -v ztp.sh command to view the contents of the script and search for any hidden characters.
root@oob-mgmt-server:/var/www/html# cat -v ./ztp_oob_windows.sh
#!/bin/bash^M
^M
###################^M
# ZTP Script^M
###################^M
^M
/usr/cumulus/bin/cl-license -i http://192.168.0.254/license.txt^M
^M
# Clean method of performing a Reboot^M
nohup bash -c 'sleep 2; shutdown now -r "Rebooting to Complete ZTP"' &^M
^M
exit 0^M
^M
# The line below is required to be a valid ZTP script^M
#CUMULUS-AUTOPROVISIONING^M
root@oob-mgmt-server:/var/www/html#
The ^M characters in the output of your ZTP script, as shown above, indicate the presence of Windows end-of-line encodings that you need to remove.
Use the translate (tr) command on any Linux system to remove the '\r' characters from the file.
root@oob-mgmt-server:/var/www/html# tr -d '\r' < ztp_oob_windows.sh > ztp_oob_unix.sh
root@oob-mgmt-server:/var/www/html# cat -v ./ztp_oob_unix.sh
#!/bin/bash
###################
# ZTP Script
###################
/usr/cumulus/bin/cl-license -i http://192.168.0.254/license.txt
# Clean method of performing a Reboot
nohup bash -c 'sleep 2; shutdown now -r "Rebooting to Complete ZTP"' &
exit 0
# The line below is required to be a valid ZTP script
#CUMULUS-AUTOPROVISIONING
root@oob-mgmt-server:/var/www/html#
Manually Use the ztp Command
To enable ZTP, use the -e option:
cumulus@switch:~$ sudo ztp -e
When you enable ZTP, it tries to run the next time the switch boots. However, if ZTP already ran on a previous boot up or if there is a manual configuration, ZTP exits without trying to look for a script.
ZTP checks for these manual configurations when the switch boots:
Password changes
Users and groups changes
Packages changes
Interfaces changes
When the switch boots for the first time, ZTP records the state of important files that can update after you configure the switch. After a reboot, ZTP compares the recorded state to the current state of these files. If they do not match, ZTP considers the switch as already provisioned and exits. ZTP only deletes these files after a reset.
To reset ZTP to its original state, use the -R option. This removes the ztp directory and ZTP runs the next time the switch reboots.
cumulus@switch:~$ sudo ztp -R
To disable ZTP, use the -d option:
cumulus@switch:~$ sudo ztp -d
To force provisioning to occur and ignore the status listed in the configuration file, use the -r option:
cumulus@switch:~$ sudo ztp -r cumulus-ztp.sh
To see the current ZTP state, use the -s option:
cumulus@switch:~$ sudo ztp -s
ZTP INFO:
State disabled
Version 1.0
Result success
Date Mon May 20 21:51:04 2019 UTC
Method Switch manually configured
URL None
Considerations
While you are writing a provisioning script, you sometimes need to reboot the switch.
You can use the Cumulus Linux onie-select -i command to reprovision the switch and install a network operating system again using ONIE.
System Configuration
This section describes how to configure the following system settings:
NVUE is an object-oriented, schema driven model of a complete Cumulus Linux system (hardware and software) providing a robust API that allows for multiple interfaces to both view (show) and configure (set and unset) any element within a system running the NVUE software.
NVUE Object Model
The NVUE object model definition uses the OpenAPI specification (OAS). Similar to YANG (RFC 6020 and RFC 7950), OAS is a data definition, manipulation, and modeling language (DML) that lets you build model-driven interfaces for both humans and machines. Although the computer networking and telecommunications industry commonly uses YANG (standardized by IETF) as a DML, the adoption of OpenAPI is broader, spanning cloud to compute to storage to IoT and even social media. The OpenAPI Initiative (OAI) consortium leads OpenAPI standardization, a chartered project under the Linux Foundation.
The OAS schema forms the management plane model with which you configure, monitor, and manage the Cumulus Linux switch. The v3.0.2 version of OAS defines the NVUE data model.
Like other systems that use OpenAPI, the NVUE OAS schema defines the endpoints (paths) exposed as RESTful APIs. With these REST APIs, you can perform various create, retrieve, update, delete, and eXecute (CRUDX) operations. The OAS schema also describes the API inputs and outputs (data models).
You can use the NVUE object model in these two ways:
Through the NVUE REST API, where you run the GET, PATCH, DELETE, and other REST APIs on the NVUE object model endpoints to configure, monitor, and manage the switch. Because of the large user community and maturity of OAS, you can use several popular tools and libraries to create client-side bindings to use the NVUE REST API.
Through the NVUE CLI, where you configure, monitor and manage the Cumulus Linux network elements. The CLI commands translate to their equivalent REST APIs, which Cumulus Linux then runs on the NVUE object model.
The CLI and the REST API are equivalent in functionality; you can run all management operations from the REST API or the CLI. The NVUE object model drives both the REST API and the CLI management operations. All operations are consistent; for example, the CLI nv show commands reflect any PATCH operation (create) you run through the REST API.
NVUE follows a declarative model, removing context-specific commands and settings. It is structured as a big tree that represents the entire state of a Cumulus Linux instance. At the base of the tree are high level branches representing objects, such as router and interface. Under each of these branches are further branches. As you navigate through the tree, you gain a more specific context. At the leaves of the tree are actual attributes, represented as key-value pairs. The path through the tree is similar to a filesystem path.
Cumulus Linux installs NVUE by default and enables the NVUE service nvued.
NVUE CLI
The NVUE CLI has a flat structure instead of a modal structure. Therefore, you can run all commands from the primary prompt instead of only in a specific mode.
You can choose to configure Cumulus Linux either with NVUE commands or Linux commands (with vtysh or by manually editing configuration files). Do not run both NVUE configuration commands (such as nv set, nv unset, nv action, and nv config) and Linux commands to configure the switch. NVUE commands replace the configuration in files such as /etc/network/interfaces and /etc/frr/frr.conf, and remove any configuration you add manually or with automation tools like Ansible, Chef, or Puppet.
If you choose to configure Cumulus Linux with NVUE, you can configure features that do not yet support the NVUE Object Model by creating snippets. See NVUE Snippets.
Command Syntax
NVUE commands all begin with nv and fall into one of three syntax categories:
Configuration (nv set and nv unset)
Monitoring (nv show)
Configuration management (nv config)
Action commands (nv action)
Command Completion
As you enter commands, you can get help with the valid keywords or options using the tab key. For example, using tab completion with nv set displays the possible options for the command and returns you to the command prompt to complete the command.
cumulus@switch:~$ nv set <<press tab>>
acl evpn mlag qos service vrf
bridge interface nve router system
cumulus@switch:~$ nv set
Command Question Mark
You can type a question mark (?) after a command to display required information quickly and concisely. When you type ?, NVUE specifies the value type, range, and options with a brief description of each; for example:
cumulus@switch:~$ nv set interface swp1 link state ?
[Enter]
down The interface is not ready
up The interface is ready
cumulus@switch:~$ nv set interface swp1 link mtu ?
<arg> (integer:552 - 9216)
cumulus@switch:~$ nv set interface swp1 link speed ?
<arg> (string | enum:10M,100M,1G,10G,25G,40G,50G,100G,200G,40
0G,800G,auto)
NVUE also indicates if you need to provide specific values for the command:
NVUE supports command abbreviation, where you can type a certain number of characters instead of a whole command to speed up CLI interaction. For example, instead of typing nv show interface, you can type nv sh int.
If the command you type is ambiguous, NVUE shows the reason for the ambiguity so that you can correct the shortcut. For example:
cumulus@switch:~$ nv s i
Ambiguous Command:
set interface
show interface
Command Help
As you enter commands, you can get help with command syntax by entering -h or --help at various points within a command entry. For example, to examine the options available for nv set interface, enter nv set interface -h or nv set interface --help.
cumulus@switch:~$ nv set interface -h
usage:
nv [options] set interface <interface-id>
Description:
interface Update all interfaces
Identifiers:
<interface-id> Interface (interface-name)
Output Options:
-o <format>, --output <format>
Supported formats: json, yaml, auto, constable, end-table, commands (default:auto)
--color (on|off|auto)
Toggle coloring of output (default: auto)
--paginate (on|off|auto)
Whether to send output to a pager (default: off)
General Options:
-h, --help Show help.
Command List
You can list all the NVUE commands by running nv list-commands. See List All NVUE Commands below.
Command History
At the command prompt, press the Up Arrow and Down Arrow keys to move back and forth through the list of commands you entered previously. When you find the command you want to use, you can run the command by pressing Enter. You can also modify the command before you run it.
Command Categories
The NVUE CLI has a flat structure; however, the commands are in three functional categories:
Configuration
Monitoring
Configuration Management
Action
Configuration Commands
The NVUE configuration commands modify switch configuration. You can set and unset configuration options.
The nv set and nv unset commands are in the following categories. Each command group includes subcommands. Use command completion (press the tab key) to list the subcommands.
Command Group
Description
nv set acl nv unset acl
Configures ACLs.
nv set bridge nv unset bridge
Configures a bridge domain. This is where you configure bridge attributes, such as the bridge type (VLAN-aware), the STP state and priority, and VLANs.
nv set evpn nv unset evpn
Configures EVPN. This is where you enable and disable the EVPN control plane, and set EVPN route advertise, multihoming, and duplicate address detection options.
nv set interface <interface-id> nv unset interface <interface-id>
Configures the switch interfaces. Use this command to configure bond and bridge interfaces, interface IP addresses and descriptions, VLAN IDs, and links (MTU, FEC, speed, duplex, and so on).
nv set mlag nv unset mlag
Configures MLAG. This is where you configure the backup IP address or interface, MLAG system MAC address, peer IP address, MLAG priority, and the delay before bonds come up.
nv set nve nv unset nve
Configures network virtualization (VXLAN) settings. This is where you configure the UDP port for VXLAN frames, control dynamic MAC learning over VXLAN tunnels, enable and disable ARP and ND suppression, and configure how Cumulus Linux handles BUM traffic in the overlay.
nv set qos nv unset qos
Configures QoS RoCE.
nv set router nv unset router
Configures router policies (prefix list rules and route maps), sets global BGP options (enable and disable, ASN and router ID, BGP graceful restart and shutdown), global OSPF options (enable and disable, router ID, and OSPF timers) PIM, IGMP, PBR, VRR, and VRRP.
nv set service nv unset service
Configures DHCP relays and servers, NTP, PTP, LLDP, SNMP servers, DNS, and syslog.
nv set system nv unset system
Configures system settings, such as the hostname of the switch, pre and post login messages, reboot options (warm, cold, fast), the time zone and global system settings, such as the anycast ID, the system MAC address, and the anycast MAC address. This is also where you configure SPAN and ERSPAN sessions and set how configuration apply operations work (which files to ignore and which files to overwrite; see Configure NVUE to Ignore Linux Files).
nv set vrf <vrf-id> nv unset vrf <vrf-id>
Configures VRFs. This is where you configure VRF-level configuration for PTP, BGP, OSPF, and EVPN.
Monitoring Commands
The NVUE monitoring commands show various parts of the network configuration. For example, you can show the complete network configuration or only interface configuration. The monitoring commands are in the following categories. Each command group includes subcommands. Use command completion (press the tab key) to list the subcommands.
Command Group
Description
nv show acl
Shows ACL configuration.
nv show action
Shows information about the action commands that reset counters and remove conflicts.
nv show bridge
Shows bridge domain configuration.
nv show evpn
Shows EVPN configuration.
nv show interface
Shows interface configuration and counters.
nv show mlag
Shows MLAG configuration.
nv show nve
Shows network virtualization configuration, such as VXLAN-specfic MLAG configuration and VXLAN flooding.
nv show platform
Shows platform configuration, such as hardware and software components.
nv show qos
Shows QoS RoCE configuration.
nv show router
Shows router configuration, such as router policies, global BGP and OSPF configuration, PBR, PIM, IGMP, VRR, and VRRP configuration.
nv show service
Shows DHCP relays and server, NTP, PTP, LLDP, and syslog configuration.
nv show system
Shows global system settings, such as the reserved routing table range for PBR and the reserved VLAN range for layer 3 VNIs. You can also see system login messages and switch reboot history.
nv show vrf
Shows VRF configuration.
The following example shows the nv show router commands after pressing the tab key, then shows the output of the nv show router bgp command.
cumulus@leaf01:mgmt:~$ nv show router <<tab>>
adaptive-routing igmp ospf pim ptm vrrp
bgp nexthop-group pbr policy vrr
cumulus@leaf01:mgmt:~$ nv show router bgp
operational applied pending description
------------------------------ ----------- ------- ----------- ----------------------------------------------------------------------
enable off on Turn the feature 'on' or 'off'. The default is 'off'.
autonomous-system none ASN for all VRFs, if a single AS is in use. If "none", then ASN mu...
graceful-shutdown off Graceful shutdown enable will initiate the GSHUT community to be an...
policy-update-timer 5 Wait time in seconds before processing updates to policies to ensur...
router-id none BGP router-id for all VRFs, if a common one is used. If "none", th...
wait-for-install off bgp waits for routes to be installed into kernel/asic before advert...
convergence-wait
establish-wait-time 0 Maximum time to wait to establish BGP sessions. Any peers which do...
time 0 Time to wait for peers to send end-of-RIB before router performs pa...
graceful-restart
mode helper-only Role of router during graceful restart. helper-only, router is in h...
path-selection-deferral-time 360 Used by the restarter as an upper-bounds for waiting for peering es...
restart-time 120 Amount of time taken to restart by router. It is advertised to the...
stale-routes-time 360 Specifies an upper-bounds on how long we retain routes from a resta...
cumulus@leaf01:mgmt:~$
If there are no pending or applied configuration changes, the nv show command only shows the running configuration (under operational).
Additional options are available for the nv show commands. For example, you can choose the configuration you want to show (pending, applied, startup, or operational). You can also turn on colored output, and paginate specific output.
Option
Description
--view
Shows these different views: acl-statistics, brief, detail, lldp, mac, mlag-cc, pluggables, qos-profile, and small. This option is available for the nv show interface command only.For example, the nv show interface --view=small command shows a list of the interfaces on the switch and the nv show interface --view=brief command shows information about each interface on the switch, such as the interface type, speed, remote host and port.The nv show interface --view=mac command shows the MAC address of each interface and the nv show interface --view=qos-profile command shows the QoS profile for the interfaces on the switch.Note: The description column only shows in the output when you use the --view=detail option.
--rev <revision>
Shows a detached pending configuration. See the nv config detach configuration management command below. For example, nv show --rev 1. You can also show only applied or only operational information in the nv show output. For example, to show only the applied settings for swp1 configuration, run the nv show interface swp1 --rev=applied command. To show only the operational settings for swp1 configuration, run the nv show interface swp1 --rev=operational command.
--applied
Shows configuration applied with the nv config apply command. For example, nv show --applied interface bond1.
--operational
Shows the running configuration (the actual system state). For example, nv show --operational interface bond1 shows the running configuration for bond1. The running and applied configuration should be the same. If different, inspect the logs.
--pending
Shows the last applied configuration and any pending set or unset configuration that you have not yet applied. For example, nv show --pending interface bond1.
--startup
Shows configuration saved with the nv config save command. This is the configuration after the switch boots.
--output
Shows command output in table (auto), json, or yaml format. For example: nv show --output auto interface bond1 nv show --output json interface bond1 nv show --output yaml interface bond1
--color
Turns colored output on or off. For example, nv show --color on interface bond1
--paginate
Paginates the output. For example, nv show --paginate on interface bond1.
--help
Shows help for the NVUE commands.
The following example shows pending BGP graceful restart configuration:
cumulus@switch:~$ nv show router bgp graceful-restart --pending
4 description
---------------------------- ----------------- ----------------------------------------------------------------------
mode helper-only Role of router during graceful restart. helper-only, router is in h...
path-selection-deferral-time 360 Used by the restarter as an upper-bounds for waiting for peeringes...
restart-time 120 Amount of time taken to restart by router. It is advertised to the...
stale-routes-time 360 Specifies an upper-bounds on how long we retain routes from a resta...
Net Show commands
In addition to the nv show commands, Cumulus Linux continues to provide a subset of the NCLU net show commands. Use these commands to get additional views of various parts of your network configuration.
cumulus@leaf01:mgmt:~$ net show
bfd : Bidirectional forwarding detection
bgp : Border Gateway Protocol
bridge : a layer2 bridge
clag : Multi-Chassis Link Aggregation
commit : apply the commit buffer to the system
configuration : settings, configuration state, etc
counters : net show counters
debugs : Debugs
dhcp-snoop : DHCP snooping for IPv4
dhcp-snoop6 : DHCP snooping for IPv6
dot1x : Configure, Enable, Delete or Show IEEE 802.1X EAPOL
evpn : Ethernet VPN
hostname : local hostname
igmp : Internet Group Management Protocol
interface : An interface, such as swp1, swp2, etc.
ip : Internet Protocol version 4/6
ipv6 : Internet Protocol version 6
lldp : Link Layer Discovery Protocol
mpls : Multiprotocol Label Switching
mroute : Static unicast routes in MRIB for multicast RPF lookup
msdp : Multicast Source Discovery Protocol
neighbor : A BGP, OSPF, PIM, etc neighbor
ospf : Open Shortest Path First (OSPFv2)
ospf6 : Open Shortest Path First (OSPFv3)
package : A Cumulus Linux package name
pbr : Policy Based Routing
pim : Protocol Independent Multicast
port-mirror : port-mirror
port-security : Port security
ptp : Precision Time Protocol
roce : Enable RoCE on all interfaces, default mode is lossless
rollback : revert to a previous configuration state
route : EVPN route information
route-map : Route-map
snmp-server : Configure the SNMP server
system : System
time : Time
version : Version number
vrf : Virtual routing and forwarding
vrrp : Virtual Router Redundancy Protocol
Configuration Management Commands
The NVUE configuration management commands manage and apply configurations.
Command
Description
nv config apply
Applies the pending configuration to become the applied configuration. You can also use these prompt options:
--y or --assume-yes to automatically reply yes to all prompts.
--assume-no to automatically reply no to all prompts.
Cumulus Linux applies but does not save the configuration; the configuration does not persist after a reboot.
You can also use these apply options: --confirm applies the configuration change but you must confirm the applied configuration. If you do not confirm within ten minutes, the configuration rolls back automatically. You can change the default time with the apply --confirm <time> command. For example, apply --confirm 60 requires you to confirm within one hour. --confirm-status shows the amount of time left before the automatic rollback.To save the pending configuration to the startup configuration automatically when you run nv config apply so that you do not have to run the nv config save command, enable auto save.
nv config detach
Detaches the configuration from the current pending configuration and uses an integer to identify it; for example, 4. To list all the current detached pending configurations, run nv config diff <<press tab>.
nv config diff <revision> <revision>
Shows differences between configurations, such as the pending configuration and the applied configuration, or the detached configuration and the pending configuration.
nv config history <revision>
Shows the apply history for the revision.
nv config patch <nvue-file>
Updates the pending configuration with the specified YAML configuration file.
nv config replace <nvue-file>
Replaces the pending configuration with the specified YAML configuration file.
nv config save
Overwrites the startup configuration with the applied configuration by writing to the /etc/nvue.d/startup.yaml file. The configuration persists after a reboot.
nv config show
Shows the currently applied configuration in yaml format. This command also shows NVUE version information.
nv config show -o commands
Shows the currently applied configuration commands.
nv config diff -o commands
Shows differences between two configuration revisions.
You can use the NVUE configuration management commands to back up and restore configuration when you upgrade Cumulus Linux on the switch. Refer to Upgrading Cumulus Linux.
Action Commands
The NVUE action commands clear counters, and provide system reboot and TACACS user disconnect options.
Reboots the switch in the configured restart mode (fast, cold, or warm). You must specify the no-confirm option with this command.
List All NVUE Commands
To show the full list of NVUE commands, run nv list-commands. For example:
cumulus@switch:~$ nv list-commands
nv show platform
nv show platform hardware
nv show platform hardware component
nv show platform hardware component <component-id>
nv show platform software
nv show platform software installed
nv show platform software installed <installed-id>
nv show platform capabilities
nv show platform environment
...
You can show the list of commands for a command grouping. For example, to show the list of interface commands:
cumulus@switch:~$ nv list-commands interface
nv show interface
nv show interface <interface-id>
nv show interface <interface-id> ip
nv show interface <interface-id> ip address
nv show interface <interface-id> ip address <ip-prefix-id>
nv show interface <interface-id> ip gateway
nv show interface <interface-id> ip gateway <ip-address-id>
...
Use the tab key to get help for the command lists you want to see. For example, to show the list of command options available for swp1, run the nv list-commands interface swp1 command and press the tab key:
cumulus@switch:~$ nv list-commands interface swp1 <<press tab>>
acl counters link ptp storm-control
bond evpn lldp qos synce
bridge ip pluggable router tunnel
NVUE Configuration File
When you save network configuration, NVUE writes the configuration to the /etc/nvue.d/startup.yaml file.
You can edit or replace the contents of the /etc/nvue.d/startup.yaml file. NVUE applies the configuration in the /etc/nvue.d/startup.yaml file during system boot only if the nvue-startup.service is running. If this service is not running, the switch reboots with the same configuration that is running before the reboot.
When you apply a configuration with nv config apply, NVUE also writes to underlying Linux files such as /etc/network/interfaces and /etc/frr/frr.conf. You can view these configuration files; however, do not manually edit them while using NVUE. If you need to configure certain network settings manually or use automation such as Ansible to configure the switch, see Configure NVUE to Ignore Linux Files below.
Configuration Files that NVUE Manages
NVUE manages the following configuration files:
File
Description
/etc/network/interfaces
Configures the network interfaces available on your system.
/etc/frr/frr.conf
Configures FRRouting.
/etc/cumulus/switchd.conf
Configures switchd options.
/etc/cumulus/switchd.d/ptp.conf
Configures PTP time stamping.
/etc/frr/daemons
Configures FRRouting services.
/etc/hosts
Configures the hostname of the switch.
/etc/default/isc-dhcp-relay-default
Configures DHCP relay options.
/etc/dhcp/dhclient-exit-hooks.d/dhcp-sethostname
Configures DHCP client options.
/etc/dhcp/dhcpd.conf
Configures DHCP server options.
/etc/hostname
Configures the hostname of the switch.
/etc/cumulus/datapath/qos/qos_features.conf
Configures QoS settings, such as traffic marking, shaping and flow control.
/etc/mlx/datapath/qos/qos_infra.conf
Configures QoS platform specific configurations, such as buffer allocations and Alpha values.
/etc/cumulus/switchd.d/qos.conf
Configures QoS settings.
/etc/cumulus/ports.conf
Configures port breakouts.
/etc/ntp.conf
Configures NTP settings.
/etc/ptp4l.conf
Configures PTP settings.
/etc/snmp/snmpd.conf
Configures SNMP settings.
When you configure the switch with NVUE commands, NVUE overwrites the settings in any file it manages. Do not run NVUE commands and manually edit the configuration files at the same time to configure the switch. Either configure the switch with NVUE commands only or manually edit the configuration files.
Search for a Specific Configuration
To search for a specific portion of the NVUE configuration, run the nv config find <search string> command. The search shows all items above and below the search string. For example, to search the entire NVUE object model configuration for any mention of ptm:
You can configure NVUE to ignore certain underlying Linux files when applying configuration changes. For example, if you push certain configuration to the switch using Ansible and Jinja2 file templates or you want to use custom configuration for a particular service such as PTP, you can ensure that NVUE never writes to those configuration files.
The following example configures NVUE to ignore the Linux /etc/ptp4l.conf file when applying configuration changes and saves the configuration so it persists after a reboot.
cumulus@switch:~$ nv set system config apply ignore /etc/ptp4l.conf
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv config save
Configure Auto Save
By default, when you run the nv config apply command to apply a configuration setting, NVUE applies the pending configuration to become the applied configuration but does not update the startup configuration file (/etc/nvue.d/startup.yaml). To save the applied configuration to the startup configuration so that the changes persist after the reboot, you must run the nv config save command. The auto save option lets you save the pending configuration to the startup configuration automatically when you run nv config apply so that you do not have to run the nv config save command.
To enable auto save:
cumulus@switch:~$ nv set system config auto-save enable on
cumulus@switch:~$ nv config apply
To disable auto save, run the nv set system config auto-save enable off command.
Add Configuration Apply Messages
When you run the nv config apply command, you can add a message that describes the configuration updates you make. You can see the message when you run the nv config history command.
To add a configuration apply message, run the nv config apply -m <message> command. If the message includes more than one word, enclose the message in quotes.
cumulus@switch:~$ nv config apply -m "this is my message"
Reset NVUE Configuration to Default Values
To reset the NVUE configuration on the switch back to the default values, run the following command:
cumulus@switch:~$ nv config apply empty
Example Configuration Commands
This section provides examples of how to configure a Cumulus Linux switch using NVUE commands.
Configure the System Hostname
The example below shows the NVUE commands required to change the hostname for the switch to leaf01:
cumulus@switch:~$ nv set system hostname leaf01
cumulus@switch:~$ nv config apply
Configure the System DNS Server
The example below shows the NVUE commands required to define the DNS server for the switch:
cumulus@switch:~$ nv set service dns mgmt server 192.168.200.1
cumulus@switch:~$ nv config apply
Configure an Interface
The example below shows the NVUE commands required to bring up swp1.
cumulus@switch:~$ nv set interface swp1
cumulus@switch:~$ nv config apply
Configure a Bond
The example below shows the NVUE commands required to configure the front panel port interfaces swp1 thru swp4 to be slaves in bond0.
cumulus@switch:~$ nv set interface bond0 bond member swp1-4
cumulus@switch:~$ nv config apply
Configure a Bridge
The example below shows the NVUE commands required to create a VLAN-aware bridge that contains two switch ports (swp1 and swp2) and includes 3 VLANs; tagged VLANs 10 and 20 and an untagged (native) VLAN of 1.
With NVUE, there is a default bridge called br_default, which has no ports assigned to it. The example below configures this default bridge.
cumulus@switch:~$ nv set interface swp1-2 bridge domain br_default
cumulus@switch:~$ nv set bridge domain br_default vlan 10,20
cumulus@switch:~$ nv set bridge domain br_default untagged 1
cumulus@switch:~$ nv config apply
Configure MLAG
The example below shows the NVUE commands required to configure MLAG on leaf01. The commands:
Place swp1 into bond1 and swp2 into bond2.
Configure the MLAG ID to 1 for bond1 and to 2 for bond2.
Add bond1 and bond2 to the default bridge (br_default).
Create the inter-chassis bond (swp49 and swp50) and the peer link (peerlink)
Set the peer link IP address to linklocal, the MLAG system MAC address to 44:38:39:BE:EF:AA, and the backup interface to 10.10.10.2.
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond2 bond mlag id 2
cumulus@switch:~$ nv set interface bond1-2 bridge domain br_default
cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:~$ nv config apply
Configure BGP Unnumbered
The example below shows the NVUE commands required to configure BGP unnumbered on leaf01. The commands:
Assign the ASN for this BGP node to 65101.
Set the router ID to 10.10.10.1.
Distribute routing information to the peer on swp51.
Originate prefixes 10.10.10.1/32 from this BGP node.
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.1/32
cumulus@leaf01:~$ nv config apply
Example Monitoring Commands
This section provides monitoring command examples.
Show Installed Software
The following example command lists the software installed on the switch:
cumulus@switch:~$ nv show platform software
Installed Software
=====================
Installed software description package version
--------------------------- --------------------------- -------------------------- -----------------------------
acpi displays information on ACPI acpi 1.7-1.1
devices
acpi-support-base scripts for handling base acpi-support-base 0.142-8
ACPI events such as the
power button
acpid Advanced Configuration and acpid 1:2.0.31-1
Power Interface event daemon
adduser add and remove users and adduser 3.118
groups
apt commandline package manager apt 1.8.2.3
...
Show Interface Configuration
The following example command shows the running, applied, and pending swp1 interface configuration.
cumulus@leaf01:~$ nv show interface swp1
operational applied description
----------------------- ----------- ------- ----------------------------------------------------------------------
t operational applied pending
----------------------- ----------------- ------- -------
type swp
ip
[address]
link
auto-negotiate off
mtu 1500
state down
stats
carrier-transitions 2
in-bytes 0 Bytes
in-drops 0
in-errors 0
in-pkts 0
out-bytes 0 Bytes
out-drops 0
out-errors 0
out-pkts 0
mac 48:b0:2d:16:d8:82
...
Example Configuration Management Commands
This section provides examples of how to use the configuration management commands to apply, save, and detach configurations.
Apply and Save a Configuration
The following example command configures the front panel port interfaces swp1 thru swp4 to be slaves in bond0. The configuration is only in a pending configuration state. The configuration is not applied. NVUE has not yet made any changes to the running configuration.
cumulus@switch:~$ nv set interface bond0 bond member swp1-4
To apply the pending configuration to the running configuration, run the nv config apply command. The configuration does not persist after a reboot.
cumulus@switch:~$ nv config apply
To save the applied configuration to the startup configuration, run the nv config save command. This command overwrites the startup configuration with the applied configuration by writing to the /etc/nvue.d/startup.yaml file. The configuration persists after a reboot.
cumulus@switch:~$ nv config save
Detach a Pending Configuration
The following example configures the IP address of the loopback interface, then detaches the configuration from the current pending configuration. Cumulus Linux saves the detached configuration to a file with a numerical value to distinguish it from other pending configurations.
cumulus@switch:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@switch:~$ nv config detach
View Differences Between Configurations
To view differences between configurations, run the nv config diff command.
To view differences between two detached pending configurations, run the nv config diff «tab» command to list all the current detached pending configurations, then run the nv config diff command with the pending configurations you want to diff.
The following example replaces the pending configuration with the contents of the YAML configuration file called nv-02/13/2021.yaml located in the /deps directory:
The following example patches the pending configuration (runs the set or unset commands from the configuration in the nv-02/13/2021.yaml file located in the /deps directory):
A patch contains a single request to the NVUE service. Ordering of parameters within a patch is not guaranteed; NVUE does not support both unset and set commands for the same object in a single patch.
NVUE and FRR Restart
NVUE restarts the FRR service when you:
Change the /etc/frr/daemons file.
Change the BGP ASN.
Remove the default instance.
Disable the SNMP server with agentx configured.
Restarting FRR restarts all the routing protocol daemons that you enable and that are running, which might impact traffic.
Date and Time
This section discusses how to:
Set the time zone, and the date and time on the software clock on the switch
This section shows you how to list all the NVUE commands and see help for the commands.
To view the NVUE command reference for Cumulus Linux, which describes all the NVUE CLI commands and provides examples, go to the NVUE Command Reference.
List All NVUE Commands
To see a list of all the NVUE nv show, nv set, nv unset, nv action, and nv config commands, run nv list-commands.
The following is only a small sample of the NVUE command list. To see the full and most up to date list of commands, run nv list-commands on your switch.
cumulus@leaf01:mgmt:~$ nv list-commands
nv show platform
nv show platform hardware
nv show platform hardware component
nv show platform hardware component <component-id>
nv show platform software
nv show platform software installed
nv show platform software installed <installed-id>
nv show platform capabilities
...
Show the NVUE Command Help
To see a description for a command, type the command with -h at the end:
cumulus@leaf01:mgmt:~$ nv set mlag backup -h
usage:
nv [options] set mlag backup <backup-ip>
Description:
backup Set of MLAG backups
Identifiers:
<backup-ip> Backup IP of MLAG peer (ipv4 | ipv6)
Output Options:
-o <format>, --output <format>
Supported formats: json, yaml, auto, constable, end-table, commands (default:auto)
--color (on|off|auto)
Toggle coloring of output (default: auto)
--paginate (on|off|auto)
Whether to send output to a pager (default: off)
General Options:
-h, --help Show help.
When you use -h, replace any variables in the command with a value. For example, for the nv set vrf <vrf-id> router pim command, type nv set vrf default router pim -h:
cumulus@leaf01:mgmt:~$ nv set vrf default router pim -h
usage:
nv [options] set vrf <vrf-id> router pim [address-family ...]
nv [options] set vrf <vrf-id> router pim [ecmp ...]
nv [options] set vrf <vrf-id> router pim [enable ...]
nv [options] set vrf <vrf-id> router pim [msdp-mesh-group ...]
nv [options] set vrf <vrf-id> router pim [timers ...]
Description:
pim PIM VRF configuration.
Identifiers:
<vrf-id> VRF (vrf-name)
Attributes:
address-family Address family specific configuration
ecmp Choose all available ECMP paths for a particular RPF. If 'off', the first nexthop found will be used. This is the default.
enable Turn the feature 'on' or 'off'. The default is 'off'.
msdp-mesh-group To connect multiple PIM-SM multicast domains using RPs.
timers Timers
Output Options:
-o <format>, --output <format>
Supported formats: json, yaml, auto, constable, end-table, commands (default:auto)
--color (on|off|auto)
Toggle coloring of output (default: auto)
--paginate (on|off|auto)
Whether to send output to a pager (default: off)
General Options:
-h, --help Show help.
NVUE Snippets
NVUE supports both traditional snippets and flexible snippets:
Use traditional snippets to add configuration to the /etc/network/interfaces, /etc/frr/frr.conf, /etc/frr/daemons, /etc/cumulus/switchd.conf, /etc/cumulus/datapath/traffic.conf or /etc/ssh/sshd_config files.
Use flexible snippets to manage any other text file on the system.
A snippet configures a single parameter associated with a specific configuration file.
You can only set or unset a snippet; you cannot modify, partially update, or change a snippet.
Setting the snippet value replaces any existing snippet value.
Cumulus Linux supports only one snippet for a configuration file.
Only certain configuration files support a snippet.
NVUE does not parse or validate the snippet content and does not validate the resulting file after you apply the snippet.
PATCH is only the method of applying snippets and does not refer to any snippet capabilities.
As NVUE supports more features and introduces new syntax, snippets and flexible snippets become invalid. Before you upgrade Cumulus Linux to a new release, review the What's New for new NVUE syntax and remove the snippet if NVUE introduces new syntax for the feature that the snippet configures.
Traditional Snippets
Use traditional snippets if you configure Cumulus Linux with NVUE commands, then want to configure a feature that does not yet support the NVUE Object Model. You create a snippet in yaml format, then add the configuration to the file with the nv config patch command.
The nv config patch command requires you to use the fully qualified path name to the snippet .yaml file; for example you cannot use ./ with the nv config patch command.
/etc/frr/frr.conf Snippets
Example 1: Top Level Configuration
NVUE does not support configuring BGP to peer across the default route. The following example configures BGP to peer across the default route from the default VRF:
Create a .yaml file with the following traditional snippet:
Run the nv config apply command to apply the configuration:
cumulus@switch:~$ nv config apply
Verify that the configuration exists at the end of the /etc/frr/frr.conf file:
cumulus@switch:~$ sudo cat /etc/frr/frr.conf
...
! end of router ospf block
!---- CUE snippets ----
ip nht resolve-via-default
Example 2: Nested Configuration
NVUE does not support configuring EVPN route targets using auto derived values from RFC 8365. The following example configures BGP to enable RFC 8365 derived router targets:
Create a .yaml file with the following traditional snippet:
The traditional snippets for FRR write content to the /etc/frr/frr.conf file. When you apply the configuration and snippet with the nv config apply command, the FRR service goes through and reads in the /etc/frr/frr.conf file.
Example 3: EVPN Multihoming FRR Debugging
NVUE does not support configuring FRR debugging for EVPN multihoming. The following example configures FRR debugging:
Create a .yaml file and add the following traditional snippet:
The traditional snippets for FRR write content to the /etc/frr/frr.conf file. When you apply the configuration and snippet with the nv config apply command, the FRR service goes through and reads in the /etc/frr/frr.conf file.
/etc/network/interfaces Snippets
MLAG Timers Example
NVUE supports configuring only one of the MLAG service timeouts (initDelay). The following example configures the MLAG peer timeout to 400 seconds:
Create a .yaml file and add the following traditional snippet:
NVUE does not support configuring traditional bridges. The following example configures a traditional bridge called br0 with the IP address 11.0.0.10/24. swp1, swp2 are members of the bridge.
Create a .yaml file and add the following traditional snippet:
Run the nv config apply command to apply the configuration:
cumulus@switch:~$ nv config apply
Verify that the configuration exists at the end of the /etc/network/interfaces file:
cumulus@switch:~$ sudo cat /etc/network/interfaces
...
auto br0
iface br0
address 11.0.0.10/24
bridge-ports swp1 swp2
bridge-vlan-aware no
VLAN-aware RSTP Timers Example
NVUE does not support configuring RSTP timers on VLAN-aware bridges. The following example configures non-default RSTP timers for the NVUE default bridge br_default:
Create a .yaml file and add the following traditional snippet:
NVUE does not provide options to configure link flap detection settings. The following example configures the link flap window to 10 seconds and the link flap threshold to 5 seconds:
Create a .yaml file and add the following traditional snippet:
Flexible snippets are an extension of traditional snippets that let you manage any text file on the system. You can add content to an existing text file or create a new text file, then add content. Cumulus Linux runs flexible snippets as root.
To configure and manage flexible snippets, your user account must be in the sudo group, which includes the NVUE system-admin role, or you must be the root user.
Flexible snippets do not support:
Binary files.
Symbolic links.
More than 1MB of content.
More than one flexible snippet in the same destination file.
Use caution when creating flexible snippets:
If you configure flexible snippets incorrectly, they might impact switch functionality. For example, even though flexible snippet validation allows you to only add textual content, Cumulus Linux does not prevent you from creating a flexible snippet that adds to sensitive text files, such as /boot/grub.cfg and /etc/fstab or add corrupt contents. Such snippets might render the switch unusable or create a potential security vulnerability (the NVUE service (nvued) runs with superuser privileges).
Do not add flexible snippets to configuration files that NVUE already controls, such as the /etc/hosts, /etc/ntp.conf, or /etc/ptp4l.conf files. Cumulus Linux does not prevent you from creating and applying a flexible snippet to these files and does not show warnings or errors. Cumulus Linux might accept the snippet content without adding it in the file. For a list of the files that NVUE manages, refer to Configuration Files that NVUE Manages.
Do not manually update configuration files to which you add flexible snippets.
To create flexible snippets:
Create a file in yaml format and add each flexible snippet you want to apply in the format shown below. NVUE appends the flexible snippet at the end of an existing file. If the file does not exist, NVUE creates the file, then adds the content.
cumulus@leaf01:mgmt:~$ sudo nano <filename>.yaml>
- set:
system:
config:
snippet:
<snippet-name>:
file: "<filename>"
permissions: "<umask-permissions>"
content: |
# This is my content
services:
<name>:
service: <service-name>
action: <action>
You can only set the umast permissions to a new file that you create. Adding the permissions: line is optional. The default umask persmissions are 644.
You can add a service with an action, such as start, restart, or stop. Adding the services: lines is optional; however, if you add the service: line, you must specify at least one service.
Run the following command to patch the configuration:
Run the nv config apply command to apply the configuration:
cumulus@switch:~$ nv config apply
Verify the patched configuration.
The nv config patch command requires you to use the fully qualified path name to the snippet .yaml file; for example you cannot use ./ with the nv config patch command.
Flexible Snippet Examples
The following example flexible snippet called crontab-flex-snippet appends the single line @daily /opt/utils/run-backup.sh to the existing /etc/crontab file, then restarts the cron service.
The following example flexible snippet called apt-flex-snippet creates a new file /etc/apt/sources.list.d/microsoft-prod.list with 0644 permissions and adds multi-line text:
- set:
system:
config:
snippet:
apt-flexible-snippet:
file: "/etc/apt/sources.list.d/microsoft-prod.list"
content: |
# Adding Microsoft SQL Server Sources
deb [arch=amd64] https://packages.microsoft.com/debian/10/prod buster main
permissions: "0644"
Remove a Snippet
To remove a traditional or flexible snippet, edit the snippet’s .yaml file to change set to unset, then patch and apply the configuration. Alternatively, you can use the REST API DELETE and PATCH methods.
Setting the time zone, and the date and time on the software clock requires root privileges; use sudo.
Set the Time Zone
You can use one of these methods to set the time zone on the switch:
Run NVUE commands.
Use the guided wizard.
Edit the /etc/timezone file.
Run the nv set system timezone <timezone> command. To see all the available time zones, run nv set system timezone and press the Tab key. The following example sets the time zone to US/Eastern:
cumulus@switch:~$ nv set system timezone US/Eastern
cumulus@switch:~$ nv config apply
In a terminal, run the following command:
cumulus@switch:~$ sudo dpkg-reconfigure tzdata
Follow the on screen menu options to select the geographic area and region.
Edit the /etc/timezone file to add your desired time zone. You can see a list of valid time zones here.
cumulus@switch:~$ sudo vi /etc/timezone
US/Eastern
The switch contains a battery backed hardware clock that maintains the time while the switch powers off and between reboots. When the switch is running, the Cumulus Linux operating system maintains its own software clock.
During boot up, the switch copies the time from the hardware clock to the operating system software clock. The software clock takes care of all the timekeeping. During system shutdown, the switch copies the software clock back to the battery backed hardware clock.
You can set the date and time on the software clock with the date command. First, determine your current time zone:
cumulus@switch:~$ date +%Z
If you need to reconfigure the current time zone, refer to the instructions above.
To set the software clock according to the configured time zone:
cumulus@switch:~$ sudo date -s "Tue Jan 26 00:37:13 2021"
You can write the current value of the software clock to the hardware clock using the hwclock command:
In addition to the CLI, NVUE supports a REST API. Instead of accessing Cumulus Linux using SSH, you can interact with the switch using an HTTP client, such as cURL or a web browser.
The nvued service provides access to the NVUE REST API. Cumulus Linux exposes the HTTP endpoint internally, which makes the NVUE REST API accessible locally within the Cumulus Linux switch. The NVUE CLI also communicates with the nvued service using internal APIs. To provide external access to the NVUE REST API, Cumulus Linux uses an HTTP reverse proxy server, and supports HTTPS and TLS connections from external REST API clients.
The following illustration shows the NVUE REST API architecture and illustrates how Cumulus Linux forwards the requests internally.
Supported HTTP Methods
The NVUE REST API supports the following methods:
The GET method displays configuration and operational data, and is equivalent to the nv show commands.
The POST method creates and submits operations. You typically use this method for nv action commands and for the nv config command to create revisions.
The PATCH method replaces or unsets a configuration. You use this method for the nv set and nv config apply commands. You can either perform:
A targeted configuration patch to make a configuration change, where you run a specific NVUE REST API targeted at a particular OpenAPI end-point URI. Based on the NVUE schema definition, you need to direct the PATCH REST API request at a particular endpoint (for example, /nvue_v1/vrf/<vrf-id>/router/bgp) and provide the payload that conforms to the schema. With a targeted configuration patch, you can control individual resources.
A root patch, where you run the NVUE PATCH API on the root node of the schema so that a single PATCH operation can change one, some, or the entire configuration in a single payload. The payload of the PATCH method must be aware of the entire NVUE object model schema because you make the configuration changes relative to the root node /nvue_v1. You typically perform a root patch to push all configurations to the switch in bulk; for example, if you use an SDN controller or a network management system to push the entire switch configuration every time you need to make a change, regardless of how small or large. A root patch can also make configuration changes with fewer round trips to the switch.
The input payload in a PATCH request can have either a set or unset json object for the same resource, but not both. The order in which the API executes the set and unset objects is not deterministic and not supported.
The DELETE method deletes a configuration and is equivalent to the nv unset commands.
Secure the API
The NVUE REST API supports HTTP basic authentication, and the same underlying authentication methods for username and password that the NVUE CLI supports. User accounts work the same on both the API and the CLI.
Certificates
Cumulus Linux includes a self-signed certificate and private key to use on the server so that it works out of the box. The switch generates the self-signed certificate and private key when it boots for the first time. The X.509 certificate with the public key is in /etc/ssl/certs/cumulus.pem and the corresponding private key is in /etc/ssl/private/cumulus.key.
NVIDIA recommends you use your own certificates and keys. Certificates must be in PEM format. For the steps to generate self-signed certificates and keys, and to install them on the switch, refer to the Ubuntu Certificates and Security documentation.
To use your own certificate chain:
Import the certificate and private key onto the Cumulus Linux switch using secure channels, such as SCP or SFTP.
Store the certificate and private key on the filesystem in a location of you choice or use the same location; for example, /etc/ssl/certs and /etc/ssl/private.
Update the /etc/nginx/sites-enabled/nvue.conf file to set the ssl_certificate and the ssl_certificate_key values to your keys.
Restart NGINX with the sudo systemctl restart nginx command.
API-only User
To create an API-only user without SSH permissions, use Linux group permissions. You can create the API-only user in the ZTP script.
# Create the dedicated automation user
adduser --disabled-password --gecos "Automation User,,,," --shell /usr/bin/nologin automation
# Set the password
echo 'automation:password!' | chpasswd
# Add the user to nvapply group to make NVUE config changes
adduser automation nvapply
Control Plane ACLs
You can secure the API with control plane ACLs. The following example allows users from the management subnet and the local switch to communicate with the switch using REST APIs, and restrict all other access.
cumulus@switch:~$ nv set acl API-PROTECT type ipv4
cumulus@switch:~$ nv set acl API-PROTECT rule 10 action permit
cumulus@switch:~$ nv set acl API-PROTECT rule 10 match ip .protocol tcp .dest-port 8765 .source-ip 192.168.200.0/24
cumulus@switch:~$ nv set acl API-PROTECT rule 10 remark "Allow the Management Subnet to talk to API"
cumulus@switch:~$ nv set acl API-PROTECT rule 20 action permit
cumulus@switch:~$ nv set acl API-PROTECT rule 20 match ip .protocol tcp .dest-port 8765 .source-ip 127.0.0.1
cumulus@switch:~$ nv set acl API-PROTECT rule 20 remark "Allow the local switch to talk to the API"
cumulus@switch:~$ nv set acl API-PROTECT rule 30 action deny
cumulus@switch:~$ nv set acl API-PROTECT rule 30 match ip .protocol tcp .dest-port 8765
cumulus@switch:~$ nv set acl API-PROTECT rule 30 remark "Block everyone else from talking to the API"
cumulus@switch:~$ nv set system control-plane acl API-PROTECT inbound
Supported Objects
The NVUE object model supports most features on the Cumulus Linux switch. The following list shows the supported objects. The NVUE API supports more objects within each of these objects. You can find a full listing of the supported API endpoints
here.
High-level Objects
Description
acl
Access control lists.
bridge
Bridge domain configuration.
evpn
EVPN configuration.
interface
Interface configuration.
mlag
MLAG configuration.
nve
Network virtualization configuration, such as VXLAN-specfic MLAG configuration and VXLAN flooding.
platform
Platform configuration, such as hardware and software components.
qos
QoS RoCE configuration.
router
Router configuration, such as router policies, global BGP and OSPF configuration, PBR, PIM, IGMP, VRR, and VRRP configuration.
service
DHCP relays and server, NTP, PTP, LLDP, and syslog configuration.
system
Global system settings, such as the reserved routing table range for PBR and the reserved VLAN range for layer 3 VNIs, system login messages and switch reboot history.
vrf
VRF configuration.
Use the API
The NVUE CLI and the REST API are equivalent in functionality; you can run all management operations from the REST API or from the CLI. The NVUE object model drives both the REST API and the CLI management operations. All operations are consistent; for example, the CLI nv show commands reflect any PATCH operation (create and update) you run through the REST API.
NVUE follows a declarative model, removing context-specific commands and settings. The structure of NVUE is like a big tree that represents the entire state of a Cumulus Linux instance. At the base of the tree are high level branches representing objects, such as router and interface. Under each of these branches are more branches. As you navigate through the tree, you gain a more specific context. At the leaves of the tree are actual attributes, represented as key-value pairs. The path through the tree is similar to a filesystem path.
Enable the NVUE REST API
To enable the NVUE REST API, run these commands on the switch:
To access the NVUE REST API from a front panel port (swp) on the switch:
Ensure that the nvue.conf file is present in the /etc/nginx/sites-enabled directory.
Either copy the packaged template file nvue.conf from the /etc/nginx/sites-available directory to the /etc/nginx/sites-enabled directory or create a symbolic link.
Edit the nvue.conf file and add the listen directive with the IPv4 or IPv6 address of the swp interface you want to use.
The default nvue.conf file includes a single listen localhost:8765 ssl; entry. Add an entry for each swp interface with its IP address. Make sure to use an accessible HTTP (TCP) port (subject to any ACL or firewall rules). For information on the NGINX listen directive, see the NGINX documentation.
The swp interfaces must be part of the default VRF on the Cumulus Linux switch or virtual appliance.
To access the REST API from the switch running curl locally, invoke the REST API client from the default VRF from the Cumulus Linux shell by prefixing the command with ip vrf exec default curl.
To access the NVUE REST API from a client on a peer Cumulus Linux switch or virtual appliance, or any other off-the-shelf Linux server or virtual machine, make sure the switch or appliance has the correct IP routing configuration so that the REST API HTTP packets arrive on the correct target interface and VRF.
Run cURL Commands
You can run the cURL commands from the command line. Use the username and password for the switch. For example:
The following examples show the primary API uses cases.
View a Configuration
Use the following example to obtain the current applied configuration on the switch. Change the rev argument to view any revision. Possible options for the rev argument include startup, pending, operational, and applied.
cumulus@switch:~$ nv show system
operational applied
-------- ------------------- -------
hostname switch01 cumulus
build Cumulus Linux 5.4.0
uptime 0:12:59
timezone Etc/UTC
cumulus@switch:~$ nv show bridge domain br_default vlan 10
operational applied pending description
--------------- ----------- ------- ------- ------------------------------------------------------
[vni] 10 10 10 L2 VNI
multicast
snooping
querier
source-ip 0.0.0.0 0.0.0.0 0.0.0.0 Source IP to use when sending IGMP/MLD queries.
ptp
enable off off off Turn the feature 'on' or 'off'. The default is 'off'.
#!/usr/bin/env python3
import requests
from requests.auth import HTTPBasicAuth
import json
import time
auth = HTTPBasicAuth(username="cumulus", password="password")
nvue_end_point = "https://127.0.0.1:8765/nvue_v1"
mime_header = {"Content-Type": "application/json"}
DUMMY_SLEEP = 5 # In seconds
POLL_APPLIED = 1 # in seconds
RETRIES = 10
def print_request(r: requests.Request):
print("=======Request=======")
print("URL:", r.url)
print("Headers:", r.headers)
print("Body:", r.body)
def print_response(r: requests.Response):
print("=======Response=======")
print("Headers:", r.headers)
print("Body:", json.dumps(r.json(), indent=2))
def create_nvue_changest():
r = requests.post(url=nvue_end_point + "/revision",
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
response = r.json()
changeset = response.popitem()[0]
return changeset
def apply_nvue_changeset(changeset):
apply_payload = {"state": "apply", "auto-prompt": {"ays": "ays_yes"}}
url = nvue_end_point + "/revision/" + requests.utils.quote(changeset,
safe="")
r = requests.patch(url=url,
auth=auth,
verify=False,
data=json.dumps(apply_payload),
headers=mime_header)
print_request(r.request)
print_response(r)
def is_config_applied(changeset) -> bool:
# Check if the configuration was indeed applied
global RETRIES
global POLL_APPLIED
retries = RETRIES
while retries > 0:
r = requests.get(url=nvue_end_point + "/revision/" + requests.utils.quote(changeset, safe=""),
auth=auth,
verify=False)
response = r.json()
print(response)
if response["state"] == "applied":
return True
retries -= 1
time.sleep(POLL_APPLIED)
return False
def apply_new_config(path,payload):
# Create a new revision ID
changeset = create_nvue_changest()
print("Using NVUE Changeset: '{}'".format(changeset))
# Delete existing configuration
query_string = {"rev": changeset}
r = requests.delete(url=nvue_end_point + path,
auth=auth,
verify=False,
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Patch the new configuration
query_string = {"rev": changeset}
r = requests.patch(url=nvue_end_point + path,
auth=auth,
verify=False,
data=json.dumps(payload),
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Apply the changes to the new revision changeset
apply_nvue_changeset(changeset)
# Check if the changeset was applied
is_config_applied(changeset)
def nvue_get(path):
r = requests.get(url=nvue_end_point + path,
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
if __name__ == "__main__":
payload = {
"99.99.99.99/32": {}
}
apply_new_config("/interface/lo/ip/address",payload)
time.sleep(DUMMY_SLEEP)
nvue_get("/interface/lo/ip/address")
cumulus@switch:~$ nv show interface lo ip address
-------------
99.99.99.99/32
127.0.0.1/8
::1/128
Troubleshoot Configuration Changes
When a configuration change fails, you see an error in the change request.
Configuration Fails Because of a Dependency
If you stage a configuration but it fails because of a dependency, the failure shows the reason. In the following example, the change fails because the BGP router ID is not set.
cumulus@switch:~$ curl -u 'cumulus:cumulus' --insecure https://127.0.0.1:8765/nvue_v1/revision/6
{
"state": "invalid",
"transition": {
"issue": {
"0": {
"code": "config_invalid",
"data": {
"location": "router.bgp.enable",
"reason": "BGP requires router-id to be set globally or in the VRF.\n"
},
"message": "Config invalid at router.bgp.enable: BGP requires router-id to be set globally or in the VRF.\n",
"severity": "error"
}
},
"progress": "Invalid config"
}
}
To resolve this issue, observe the failures or errors, then inspect the configuration that you are trying to apply. After you resolve the errors, retry the API. If you prefer to overlook the errors and force an apply, add "auto-prompt":{"ays": "ays_yes"} to the configuration apply.
To save an applied configuration change to the startup configuration file (/etc/nvue.d/startup.yaml) so that the changes persist after a reboot, use a PATCH to the applied revision with the save state.
When you unset a change, you must still use the PATCH action. The value indicates removal of the entry. The data is {"vlan100":null} with the PATCH action.
Use the API for Active Monitoring
The example below fetches the counters for interface swp1.
#!/usr/bin/env python3
import requests
from requests.auth import HTTPBasicAuth
import json
import time
auth = HTTPBasicAuth(username="cumulus", password="password")
nvue_end_point = "https://127.0.0.1:8765/nvue_v1"
mime_header = {"Content-Type": "application/json"}
if __name__ == "__main__":
r = requests.get(url=nvue_end_point + "/interface/swp1/link/stats",
auth=auth,
verify=False)
print("=======Interface swp1 Statistics=======")
print(json.dumps(r.json(), indent=2))
cumulus@switch:~$ nv show interface swp1 link stats
operational applied pending description
------------------- ----------- ------- ------- ----------------------------------------------------------------------
carrier-transitions 6 Number of times the interface state has transitioned between up and...
in-bytes 280.15 MB total number of bytes received on the interface
in-drops 0 number of received packets dropped
in-errors 0 number of received packets with errors
in-pkts 2321659 total number of packets received on the interface
out-bytes 349.10 MB total number of bytes transmitted out of the interface
out-drops 0 The number of outbound packets that were chosen to be discarded eve...
out-errors 0 The number of outbound packets that could not be transmitted becaus...
out-pkts 3536508 total number of packets transmitted out of the interface
Convert CLI Changes to Use the API
You can take a configuration change from the CLI and use the API to configure the same set of changes.
Make your configuration changes on the system with the NVUE CLI.
cumulus@switch:~$ nv set system hostname switch01
cumulus@switch:~$ nv set interface lo ip address 99.99.99.99/32
cumulus@switch:~$ nv set interface eth0 ip address 192.168.200.6/24
cumulus@switch:~$ nv set interface bond0 bond member swp1-4
#!/usr/bin/env python3
import requests
from requests.auth import HTTPBasicAuth
import json
import time
auth = HTTPBasicAuth(username="cumulus", password="password")
nvue_end_point = "https://127.0.0.1:8765/nvue_v1"
mime_header = {"Content-Type": "application/json"}
DUMMY_SLEEP = 5 # In seconds
POLL_APPLIED = 1 # in seconds
RETRIES = 10
def print_request(r: requests.Request):
print("=======Request=======")
print("URL:", r.url)
print("Headers:", r.headers)
print("Body:", r.body)
def print_response(r: requests.Response):
print("=======Response=======")
print("Headers:", r.headers)
print("Body:", json.dumps(r.json(), indent=2))
def create_nvue_changest():
r = requests.post(url=nvue_end_point + "/revision",
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
response = r.json()
changeset = response.popitem()[0]
return changeset
def apply_nvue_changeset(changeset):
# apply_payload = {"state": "apply"}
apply_payload = {"state": "apply", "auto-prompt": {"ays": "ays_yes"}}
url = nvue_end_point + "/revision/" + requests.utils.quote(changeset,
safe="")
r = requests.patch(url=url,
auth=auth,
verify=False,
data=json.dumps(apply_payload),
headers=mime_header)
print_request(r.request)
print_response(r)
def is_config_applied(changeset) -> bool:
# Check if the configuration was indeed applied
global RETRIES
global POLL_APPLIED
retries = RETRIES
while retries > 0:
r = requests.get(url=nvue_end_point + "/revision/" + requests.utils.quote(changeset, safe=""),
auth=auth,
verify=False)
response = r.json()
print(response)
if response["state"] == "applied":
return True
retries -= 1
time.sleep(POLL_APPLIED)
return False
def apply_new_config(path,payload):
# Create a new revision ID
changeset = create_nvue_changest()
print("Using NVUE Changeset: '{}'".format(changeset))
# Delete existing configuration
query_string = {"rev": changeset}
r = requests.delete(url=nvue_end_point + path,
auth=auth,
verify=False,
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Patch the new configuration
query_string = {"rev": changeset}
r = requests.patch(url=nvue_end_point + path,
auth=auth,
verify=False,
data=json.dumps(payload),
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Apply the changes to the new revision changeset
apply_nvue_changeset(changeset)
# Check if the changeset was applied
is_config_applied(changeset)
def nvue_get(path):
r = requests.get(url=nvue_end_point + path,
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
if __name__ == "__main__":
payload = {
"interface": {
"bond0": {
"bond": {
"member": {
"swp1": {},
"swp2": {},
"swp3": {},
"swp4": {}
}
},
"type": "bond"
},
"lo": {
"ip": {
"address": {
"99.99.99.99/32": {}
}
}
}
},
"system": {
"hostname": "switch01"
}
}
apply_new_config("/",payload)
time.sleep(DUMMY_SLEEP)
nvue_get("/interface/bond0")
nvue_get("/interface/lo")
nvue_get("/system")
API Examples
The following section provides practical API examples.
Configure the System
To set the system hostname, pre-login or post-login message, and time zone on the switch, send a targeted API request to /nvue_v1/system.
cumulus@switch:~$ curl -u 'cumulus:cumulus' -d '{"system": {"hostname":"switch01","timezone":"America/Los_Angeles","message":{"pre-login":"Welcome to NVIDIA Cumulus Linux","post-login:"You have successfully logged in to switch01"}}}' -k -X PATCH https://127.0.0.1:8765/nvue_v1/?rev=4
#!/usr/bin/env python3
import requests
from requests.auth import HTTPBasicAuth
import json
import time
auth = HTTPBasicAuth(username="cumulus", password="password")
nvue_end_point = "https://127.0.0.1:8765/nvue_v1"
mime_header = {"Content-Type": "application/json"}
DUMMY_SLEEP = 5 # In seconds
POLL_APPLIED = 1 # in seconds
RETRIES = 10
def print_request(r: requests.Request):
print("=======Request=======")
print("URL:", r.url)
print("Headers:", r.headers)
print("Body:", r.body)
def print_response(r: requests.Response):
print("=======Response=======")
print("Headers:", r.headers)
print("Body:", json.dumps(r.json(), indent=2))
def create_nvue_changest():
r = requests.post(url=nvue_end_point + "/revision",
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
response = r.json()
changeset = response.popitem()[0]
return changeset
def apply_nvue_changeset(changeset):
# apply_payload = {"state": "apply"}
apply_payload = {"state": "apply", "auto-prompt": {"ays": "ays_yes"}}
url = nvue_end_point + "/revision/" + requests.utils.quote(changeset,
safe="")
r = requests.patch(url=url,
auth=auth,
verify=False,
data=json.dumps(apply_payload),
headers=mime_header)
print_request(r.request)
print_response(r)
def is_config_applied(changeset) -> bool:
# Check if the configuration was indeed applied
global RETRIES
global POLL_APPLIED
retries = RETRIES
while retries > 0:
r = requests.get(url=nvue_end_point + "/revision/" + requests.utils.quote(changeset, safe=""),
auth=auth,
verify=False)
response = r.json()
print(response)
if response["state"] == "applied":
return True
retries -= 1
time.sleep(POLL_APPLIED)
return False
def apply_new_config(path,payload):
# Create a new revision ID
changeset = create_nvue_changest()
print("Using NVUE Changeset: '{}'".format(changeset))
# Delete existing configuration
query_string = {"rev": changeset}
r = requests.delete(url=nvue_end_point + path,
auth=auth,
verify=False,
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Patch the new configuration
query_string = {"rev": changeset}
r = requests.patch(url=nvue_end_point + path,
auth=auth,
verify=False,
data=json.dumps(payload),
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Apply the changes to the new revision changeset
apply_nvue_changeset(changeset)
# Check if the changeset was applied
is_config_applied(changeset)
def nvue_get(path):
r = requests.get(url=nvue_end_point + path,
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
if __name__ == "__main__":
payload = {
"system":
{
"hostname":"switch01",
"timezone":"America/Los_Angeles",
"message":
{
"pre-login":"Welcome to NVIDIA Cumulus Linux",
"post-login:"You have successfully logged in to switch01"
}
}
}
apply_new_config("/",payload) # Root patch
time.sleep(DUMMY_SLEEP)
nvue_get("/system")
cumulus@switch:~$ nv set system hostname switch01
cumulus@switch:~$ nv set system timezone America/Los_Angeles
cumulus@switch:~$ nv set system message pre-login "Welcome to NVIDIA Cumulus Linux"
cumulus@switch:~$ nv set system message post-login "You have successfully logged into switch01"
Configure Services
To set up NTP, DNS, and SNMP on the switch, send a targeted API request to /nvue_v1/service.
#!/usr/bin/env python3
import requests
from requests.auth import HTTPBasicAuth
import json
import time
auth = HTTPBasicAuth(username="cumulus", password="password")
nvue_end_point = "https://127.0.0.1:8765/nvue_v1"
mime_header = {"Content-Type": "application/json"}
DUMMY_SLEEP = 5 # In seconds
POLL_APPLIED = 1 # in seconds
RETRIES = 10
def print_request(r: requests.Request):
print("=======Request=======")
print("URL:", r.url)
print("Headers:", r.headers)
print("Body:", r.body)
def print_response(r: requests.Response):
print("=======Response=======")
print("Headers:", r.headers)
print("Body:", json.dumps(r.json(), indent=2))
def create_nvue_changest():
r = requests.post(url=nvue_end_point + "/revision",
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
response = r.json()
changeset = response.popitem()[0]
return changeset
def apply_nvue_changeset(changeset):
# apply_payload = {"state": "apply"}
apply_payload = {"state": "apply", "auto-prompt": {"ays": "ays_yes"}}
url = nvue_end_point + "/revision/" + requests.utils.quote(changeset,
safe="")
r = requests.patch(url=url,
auth=auth,
verify=False,
data=json.dumps(apply_payload),
headers=mime_header)
print_request(r.request)
print_response(r)
def is_config_applied(changeset) -> bool:
# Check if the configuration was indeed applied
global RETRIES
global POLL_APPLIED
retries = RETRIES
while retries > 0:
r = requests.get(url=nvue_end_point + "/revision/" + requests.utils.quote(changeset, safe=""),
auth=auth,
verify=False)
response = r.json()
print(response)
if response["state"] == "applied":
return True
retries -= 1
time.sleep(POLL_APPLIED)
return False
def apply_new_config(path,payload):
# Create a new revision ID
changeset = create_nvue_changest()
print("Using NVUE Changeset: '{}'".format(changeset))
# Delete existing configuration
query_string = {"rev": changeset}
r = requests.delete(url=nvue_end_point + path,
auth=auth,
verify=False,
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Patch the new configuration
query_string = {"rev": changeset}
r = requests.patch(url=nvue_end_point + path,
auth=auth,
verify=False,
data=json.dumps(payload),
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Apply the changes to the new revision changeset
apply_nvue_changeset(changeset)
# Check if the changeset was applied
is_config_applied(changeset)
def nvue_get(path):
r = requests.get(url=nvue_end_point + path,
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
if __name__ == "__main__":
payload = {
"service":
{
"ntp":
{
"default":
{
"server:
{
"4.cumulusnetworks.pool.ntp.org":
{
"iburst":"on"
}
}
}
},
"dns":
{
"mgmt":
{
"server:
{
"192.168.1.100":{}
}
}
},
"syslog":
{
"mgmt":
{
"server:
{
"192.168.1.120":
{
"port":8000
}
}
}
}
}
}
apply_new_config("/",payload) # Root patch
time.sleep(DUMMY_SLEEP)
nvue_get("/service/ntp")
nvue_get("/service/dns")
nvue_get("/service/syslog")
cumulus@switch:~$ nv set service ntp default server 4.cumulusnetworks.pool.ntp.org iburst on
cumulus@switch:~$ nv set service dns mgmt server 192.168.1.100
cumulus@switch:~$ nv set service syslog mgmt server 192.168.1.120 port 8000
Configure Users
The following example creates a new user, then deletes the user.
#!/usr/bin/env python3
import requests
from requests.auth import HTTPBasicAuth
import json
import time
auth = HTTPBasicAuth(username="cumulus", password="password")
nvue_end_point = "https://127.0.0.1:8765/nvue_v1"
mime_header = {"Content-Type": "application/json"}
DUMMY_SLEEP = 5 # In seconds
POLL_APPLIED = 1 # in seconds
RETRIES = 10
def print_request(r: requests.Request):
print("=======Request=======")
print("URL:", r.url)
print("Headers:", r.headers)
print("Body:", r.body)
def print_response(r: requests.Response):
print("=======Response=======")
print("Headers:", r.headers)
print("Body:", json.dumps(r.json(), indent=2))
def create_nvue_changest():
r = requests.post(url=nvue_end_point + "/revision",
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
response = r.json()
changeset = response.popitem()[0]
return changeset
def apply_nvue_changeset(changeset):
# apply_payload = {"state": "apply"}
apply_payload = {"state": "apply", "auto-prompt": {"ays": "ays_yes"}}
url = nvue_end_point + "/revision/" + requests.utils.quote(changeset,
safe="")
r = requests.patch(url=url,
auth=auth,
verify=False,
data=json.dumps(apply_payload),
headers=mime_header)
print_request(r.request)
print_response(r)
def is_config_applied(changeset) -> bool:
# Check if the configuration was indeed applied
global RETRIES
global POLL_APPLIED
retries = RETRIES
while retries > 0:
r = requests.get(url=nvue_end_point + "/revision/" + requests.utils.quote(changeset, safe=""),
auth=auth,
verify=False)
response = r.json()
print(response)
if response["state"] == "applied":
return True
retries -= 1
time.sleep(POLL_APPLIED)
return False
def apply_new_config(path,payload):
# Create a new revision ID
changeset = create_nvue_changest()
print("Using NVUE Changeset: '{}'".format(changeset))
# Delete existing configuration
query_string = {"rev": changeset}
r = requests.delete(url=nvue_end_point + path,
auth=auth,
verify=False,
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Patch the new configuration
query_string = {"rev": changeset}
r = requests.patch(url=nvue_end_point + path,
auth=auth,
verify=False,
data=json.dumps(payload),
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Apply the changes to the new revision changeset
apply_nvue_changeset(changeset)
# Check if the changeset was applied
is_config_applied(changeset)
def delete_config(path):
# Create an NVUE changeset
changeset = create_nvue_changest()
print("Using NVUE Changeset: '{}'".format(changeset))
# Equivalent to JSON `null`
payload = None
# Stage the change
query_string = {"rev": changeset}
r = requests.delete(url=nvue_end_point + path,
auth=auth,
verify=False,
data=json.dumps(payload),
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Apply the staged changeset
apply_nvue_changeset(changeset)
# Check if the changeset was applied
is_config_applied(changeset)
def nvue_get(path):
r = requests.get(url=nvue_end_point + path,
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
if __name__ == "__main__":
# Need to create a hashed password - The supported password
# hashes are documented here:
# https://docs.nvidia.com/networking-ethernet-software/cumulus-linux-55/System-Configuration/Authentication-Authorization-and-Accounting/User-Accounts/#hashed-passwords # noqa
# Here in this example, we use SHA-512
import crypt
hashed_password = crypt.crypt("hello$world#2023", salt=crypt.METHOD_SHA512)
payload = {
"system": {
"aaa": {
"user": {
"test1": {
"hashed-password": hashed_password,
"role": "nvue-monitor",
"enable": "on",
"full-name": "Test User",
}
}
}
}
}
apply_new_config("/",payload) # Root patch
time.sleep(DUMMY_SLEEP)
nvue_get("/system/user/aaa")
"""Delete an existing user account using the AAA API."""
delete_config("/system/aaa/user/test1")
time.sleep(DUMMY_SLEEP)
nvue_get("/system/user/aaa")
This example creates a new user test1.
cumulus@switch:~$ nv set system aaa user test1
cumulus@switch:~$ nv set system aaa user test1 full-name "Test User"
cumulus@switch:~$ nv set system aaa user test1 password "abcd@test"
cumulus@switch:~$ nv set system aaa user test1 role nvue-monitor
cumulus@switch:~$ nv set system aaa user test1 enable on
#!/usr/bin/env python3
import requests
from requests.auth import HTTPBasicAuth
import json
import time
auth = HTTPBasicAuth(username="cumulus", password="password")
nvue_end_point = "https://127.0.0.1:8765/nvue_v1"
mime_header = {"Content-Type": "application/json"}
DUMMY_SLEEP = 5 # In seconds
POLL_APPLIED = 1 # in seconds
RETRIES = 10
def print_request(r: requests.Request):
print("=======Request=======")
print("URL:", r.url)
print("Headers:", r.headers)
print("Body:", r.body)
def print_response(r: requests.Response):
print("=======Response=======")
print("Headers:", r.headers)
print("Body:", json.dumps(r.json(), indent=2))
def create_nvue_changest():
r = requests.post(url=nvue_end_point + "/revision",
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
response = r.json()
changeset = response.popitem()[0]
return changeset
def apply_nvue_changeset(changeset):
# apply_payload = {"state": "apply"}
apply_payload = {"state": "apply", "auto-prompt": {"ays": "ays_yes"}}
url = nvue_end_point + "/revision/" + requests.utils.quote(changeset,
safe="")
r = requests.patch(url=url,
auth=auth,
verify=False,
data=json.dumps(apply_payload),
headers=mime_header)
print_request(r.request)
print_response(r)
def is_config_applied(changeset) -> bool:
# Check if the configuration was indeed applied
global RETRIES
global POLL_APPLIED
retries = RETRIES
while retries > 0:
r = requests.get(url=nvue_end_point + "/revision/" + requests.utils.quote(changeset, safe=""),
auth=auth,
verify=False)
response = r.json()
print(response)
if response["state"] == "applied":
return True
retries -= 1
time.sleep(POLL_APPLIED)
return False
def apply_new_config(path,payload):
# Create a new revision ID
changeset = create_nvue_changest()
print("Using NVUE Changeset: '{}'".format(changeset))
# Delete existing configuration
query_string = {"rev": changeset}
r = requests.delete(url=nvue_end_point + path,
auth=auth,
verify=False,
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Patch the new configuration
query_string = {"rev": changeset}
r = requests.patch(url=nvue_end_point + path,
auth=auth,
verify=False,
data=json.dumps(payload),
params=query_string,
headers=mime_header)
print_request(r.request)
print_response(r)
# Apply the changes to the new revision changeset
apply_nvue_changeset(changeset)
# Check if the changeset was applied
is_config_applied(changeset)
def nvue_get(path):
r = requests.get(url=nvue_end_point + path,
auth=auth,
verify=False)
print_request(r.request)
print_response(r)
if __name__ == "__main__":
rt_payload = {
"bgp":
{
"autonomous-system": 65101,
"router-id":"10.10.10.1"
}
}
apply_new_config("/router",rt_payload)
vrf_payload = {
"bgp":
{
"neighbor":
{
"swp51":
{
"remote-as":"external"
}
},
"address-family":
{
"ipv4-unicast":
{
"network":
{
"10.10.10.1/32":{}
}
}
}
}
}
apply_new_config("/vrf/default/router",vrf_payload)
time.sleep(DUMMY_SLEEP)
nvue_get("/router")
nvue_get("/vrf/default/router")
cumulus@switch:~$ nv set router bgp autonomous-system 65101
cumulus@switch:~$ nv set router bgp router-id 10.10.10.1
cumulus@switch:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@switch:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.1/32
Action Operations
The NVUE action operations are ephemeral operations that do not modify the state of the configuration; they reset counters for interfaces, BGP, QoS buffers and pools, and remove conflicts from protodown MLAG bonds.
In the following python example, the full_config_example() method sets the system pre-login message, enables BGP globally, and changes a few other configuration settings in a single bulk operation. The API end-point goes to the root node /nvue_v1. The bridge_config_example() method performs a targeted API request to /nvue_v1/bridge/domain/<domain-id> to set the vlan-vni-offset attribute.
To try out the NVUE REST API, use the NVUE API Lab available on NVIDIA Air. The lab provides a basic example to help you get started. You can also try out the other examples in this document.
Unlike the NVUE CLI, the NVUE API does not support configuring a plain text password for a user account; you must configure a hashed password for a user account with the NVUE API.
If you need to make multiple updates on the switch, NVIDIA recommends you use a root patch, which can make configuration changes with fewer round trips to the switch. Running many specific NVUE PATCH APIs to set or unset objects requires many round trips to the switch to set up the HTTP connection, transfer payload and responses, manage network utilization, and so on.
The ntpd daemon running on the switch implements the NTP protocol. It synchronizes the system time with time servers in the /etc/ntp.conf file. The ntpd daemon starts at boot by default.
If you intend to run this service within a VRF, including the management VRF, follow these steps to configure the service.
Configure NTP Servers
The default NTP configuration includes the following servers, which are in the /etc/ntp.conf file:
server 0.cumulusnetworks.pool.ntp.org iburst
server 1.cumulusnetworks.pool.ntp.org iburst
server 2.cumulusnetworks.pool.ntp.org iburst
server 3.cumulusnetworks.pool.ntp.org iburst
To add the NTP servers you want to use, run the following commands. Include the iburst option to increase the sync speed.
The NVUE command requires a VRF. The following command adds the NTP servers in the default VRF.
cumulus@switch:~$ nv set service ntp default server 4.cumulusnetworks.pool.ntp.org iburst on
cumulus@switch:~$ nv config apply
Edit the /etc/ntp.conf file to add or update NTP server information:
cumulus@switch:~$ sudo nano /etc/ntp.conf
# pool.ntp.org maps to about 1000 low-stratum NTP servers. Your server will
# pick a different set every time it starts up. Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
server 0.cumulusnetworks.pool.ntp.org iburst
server 1.cumulusnetworks.pool.ntp.org iburst
server 2.cumulusnetworks.pool.ntp.org iburst
server 3.cumulusnetworks.pool.ntp.org iburst
server 4.cumulusnetworks.pool.ntp.org iburst
To set the initial date and time with NTP before starting the ntpd daemon, run the ntpd -q command. Be aware that ntpd -q can hang if the time servers are not reachable.
cumulus@switch:~$ nv show service ntp default server
cumulus@switch:~$ ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
+ec2-34-225-6-20 129.6.15.30 2 u 73 1024 377 70.414 -2.414 4.110
+lax1.m-d.net 132.163.96.1 2 u 69 1024 377 11.676 0.155 2.736
*69.195.159.158 199.102.46.72 2 u 133 1024 377 48.047 -0.457 1.856
-2.time.dbsinet. 198.60.22.240 2 u 1057 1024 377 63.973 2.182 2.692
The following example commands remove some of the default NTP servers:
cumulus@switch:~$ nv unset service ntp default server 0.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ nv unset service ntp default server 1.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ nv unset service ntp default server 2.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ nv unset service ntp default server 3.cumulusnetworks.pool.ntp.org
cumulus@switch:~$ nv config apply
Edit the /etc/ntp.conf file to delete NTP servers.
cumulus@switch:~$ sudo nano /etc/ntp.conf
...
# pool.ntp.org maps to about 1000 low-stratum NTP servers. Your server will
# pick a different set every time it starts up. Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
server 4.cumulusnetworks.pool.ntp.org iburst
...
Specify the NTP Source Interface
By default, the source interface that NTP uses is eth0. The following example command configures the NTP source interface to be swp10.
cumulus@switch:~$ nv set service ntp default listen swp10
cumulus@switch:~$ nv config apply
Edit the /etc/ntp.conf file and modify the entry under the Specify interfaces comment.
You can use DHCP to specify your NTP servers. Ensure that the DHCP-generated configuration file /run/ntp.conf.dhcp exists. The /etc/dhcp/dhclient-exit-hooks.d/ntp script generates this file, which is a copy of the default /etc/ntp.conf file with a modified server list from the DHCP server. If this file does not exist and you plan on using DHCP in the future, you can copy your current /etc/ntp.conf file to the location of the DHCP file.
To use DHCP to specify your NTP servers, run the sudo -E systemctl edit ntp.service command and add the ExecStart= line:
The sudo -E systemctl edit ntp.service command always updates the base ntp.service even if you use ntp@mgmt.service. The ntp@mgmt.service is re-generated automatically.
To validate that your configuration, run these commands:
If the state is not Active, or the alternate configuration file does not appear in the ntp command line, it is likely that you made a configuration mistake. Correct the mistake and rerun the commands above to verify.
Configure NTP with Authorization Keys
For added security, you can configure NTP to use authorization keys.
Configure the NTP Server
Create a .keys file, such as /etc/ntp.keys. Specify a key identifier (a number between 1 and 65535), an encryption method (M for MD5), and the password. The following provides an example:
#
# PLEASE DO NOT USE THE DEFAULT VALUES HERE.
#
#65535 M akey
#1 M pass
1 M CumulusLinux!
In the /etc/ntp.conf file, add a pointer to the /etc/ntp.keys file you created above and specify the key identifier. For example:
Restart NTP with the sudo systemctl restart ntp command.
Configure the NTP Client
The NTP client is the Cumulus Linux switch.
Create the same .keys file you created on the NTP server (/etc/ntp.keys). For example:
cumulus@switch:~$ sudo nano /etc/ntp.keys
#
# DO NOT USE THE DEFAULT VALUES HERE.
#
#65535 M akey
#1 M pass
1 M CumulusLinux!
Edit the /etc/ntp.conf file to specify the server you want to use, the key identifier, and a pointer to the /etc/ntp.keys file you created in step 1. For example:
cumulus@switch:~$ sudo nano /etc/ntp.conf
...
# You do need to talk to an NTP server or two (or three).
#pool ntp.your-provider.example
# OR
#server ntp.your-provider.example
# pool.ntp.org maps to about 1000 low-stratum NTP servers. Your server will
# pick a different set every time it starts up. Please consider joining the
# pool: <http://www.pool.ntp.org/join.html>
#server 0.cumulusnetworks.pool.ntp.org iburst
#server 1.cumulusnetworks.pool.ntp.org iburst
#server 2.cumulusnetworks.pool.ntp.org iburst
#server 3.cumulusnetworks.pool.ntp.org iburst
server 10.50.23.121 key 1
#keys
keys /etc/ntp.keys
trustedkey 1
controlkey 1
requestkey 1
...
Restart NTP in the active VRF (default or management). For example:
Wait a few minutes, then run the ntpq -c as command to verify the configuration:
cumulus@switch:~$ ntpq -c as
ind assid status conf reach auth condition last_event cnt
===========================================================
1 40828 f014 yes yes ok reject reachable 1
After a successful authorization, you see the following command output:
cumulus@switch:~$ ntpq -c as
ind assid status conf reach auth condition last_event cnt
===========================================================
1 40828 f61a yes yes ok sys.peer sys_peer 1
Considerations
NTP in Cumulus Linux uses the /usr/share/zoneinfo/leap-seconds.list file, which expires periodically and results in generated log messages about the expiration. When the file expires, update it from https://www.ietf.org/timezones/data/leap-seconds.list or upgrade the tzdata package to the newest version.
Cumulus Linux supports IEEE 1588-2008 Precision Timing Protocol (PTPv2), which defines the algorithm and method for synchronizing clocks of various devices across packet-based networks, including Ethernet switches and IP routers.
PTP is capable of sub-microsecond accuracy. The clocks are in a master-slave hierarchy, where the slaves synchronize to their masters, which can be slaves to their own masters. The best master clock (BMC) algorithm, which runs on every clock, creates and updates the hierarchy automatically. The grandmaster clock is the top-level master. To provide a high-degree of accuracy, a Global Positioning System (GPS) time source typically synchronizes the grandmaster clock.
In the following example:
Boundary clock 2 receives time from Master 1 (the grandmaster) on a PTP slave port, sets its clock and passes the time down from the PTP master port to Boundary clock 1.
Boundary clock 1 receives the time on a PTP slave port, sets its clock and passes the time down the hierarchy through the PTP master ports to the hosts that receive the time.
Cumulus Linux and PTP
PTP in Cumulus Linux uses the linuxptp package that includes the following programs:
ptp4l provides the PTP protocol and state machines
phc2sys provides PTP Hardware Clock and System Clock synchronization
timemaster provides System Clock and PTP synchronization
Cumulus Linux supports:
PTP boundary clock mode only (the switch provides timing to downstream servers; it is a slave to a higher-level clock and a master to downstream clocks).
UDPv4, UDPv6, and 802.3 encapsulation.
Only a single PTP domain per network.
PTP on layer 3 interfaces, layer 3 bonds, trunk ports, and switch ports belonging to a VLAN.
Multicast, unicast, and mixed message mode.
End-to-End delay mechanism only. Cumulus Linux does not support Peer-to-Peer.
Two-step clock correction mode, where PTP notes the time when the packet goes out of the port and sends the time in a separate (follow-up) message. Cumulus Linux does not support one-step mode.
Hardware time stamping for PTP packets. This allows PTP to avoid inaccuracies caused by message transfer delays and improves the accuracy of time synchronization.
You cannot run both PTP and NTP on the switch.
PTP supports the default VRF only.
1G links might have a lower accuracy for PTP due to hardware limitations. If your application needs high accuracy from PTP, use higher link speeds.
Basic Configuration
Basic PTP configuration requires you:
Enable PTP on the switch.
Configure PTP on at least one interface; this can be a layer 3 routed port, switch port, or trunk port. You do not need to specify which is a master interface and which is a slave interface; the PTP Best Master Clock Algorithm (BMCA) determines the master and slave.
If you configure PTP with Linux commands, you must also enable PTP timestamping; see step 1 of the Linux procedure below. NVUE enables timestamping when you enable PTP on the switch.
The basic configuration shown below uses the default PTP settings:
The clock mode is Boundary. This is the only clock mode that Cumulus Linux supports.
The delay mechanism is End-to-End (E2E), where the slave measures the delay between itself and the master. The master and slave send delay request and delay response messages between each other to measure the delay.
The hardware packet time stamping mode is two-step.
To configure other settings, such as the PTP profile, domain, priority, and DSCP, the PTP interface transport mode and timers, and PTP monitoring, see the Optional Configuration sections below.
The NVUE nv set service ptp commands require an instance number (1 in the example command below) for management purposes.
When you enable the PTP service with the nv set service ptp <instance> enable on command, NVUE restarts the switchd service, which causes all network ports to reset in addition to resetting the switch hardware configuration.
cumulus@switch:~$ nv set service ptp 1 enable on
cumulus@switch:~$ nv set interface swp1 ip address 10.0.0.9/32
cumulus@switch:~$ nv set interface swp2 ip address 10.0.0.10/32
cumulus@switch:~$ nv set interface swp1 ptp enable on
cumulus@switch:~$ nv set interface swp2 ptp enable on
cumulus@switch:~$ nv config apply
The configuration writes to the /etc/ptp4l.conf file.
cumulus@switch:~$ nv set service ptp 1 enable on
cumulus@switch:~$ nv set bridge domain br_default
cumulus@switch:~$ nv set bridge domain br_default type vlan-aware
cumulus@switch:~$ nv set bridge domain br_default vlan 10-30
cumulus@switch:~$ nv set bridge domain br_default vlan 10 ptp enable on
cumulus@switch:~$ nv set interface vlan10 type svi
cumulus@switch:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@switch:~$ nv set interface vlan10 ptp enable on
cumulus@switch:~$ nv set interface swp1 bridge domain br_default
cumulus@switch:~$ nv set interface swp1 bridge domain br_default vlan 10
cumulus@switch:~$ nv set interface swp1 ptp enable on
cumulus@switch:~$ nv config apply
You can configure only one address; either IPv4 or IPv6.
For IPv6, set the trunk port transport mode to ipv6.
The configuration writes to the /etc/ptp4l.conf file.
cumulus@switch:~$ nv set service ptp 1 enable on
cumulus@switch:~$ nv set bridge domain br_default
cumulus@switch:~$ nv set bridge domain br_default type vlan-aware
cumulus@switch:~$ nv set bridge domain br_default vlan 10-30
cumulus@switch:~$ nv set bridge domain br_default vlan 10 ptp enable on
cumulus@switch:~$ nv set interface vlan10 type svi
cumulus@switch:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@switch:~$ nv set interface swp2 bridge domain br_default
cumulus@switch:~$ nv set interface swp2 bridge domain br_default access 10
cumulus@switch:~$ nv set interface swp2 ptp enable on
cumulus@switch:~$ nv config apply
You can configure only one address; either IPv4 or IPv6.
For IPv6, set the trunk port transport mode to ipv6.
The configuration writes to the /etc/ptp4l.conf file.
Edit the /etc/cumulus/switchd.d/ptp.conf file to set the ptp.timestamping parameter to TRUE:
Edit the Default interface options section of the /etc/ptp4l.conf file to configure the interfaces on the switch that you want to use for PTP.
cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
[global]
#
# Default Data Set
#
slaveOnly 0
priority1 128
priority2 128
domainNumber 0
dscp_event 46
dscp_general 46
network_transport L2
dataset_comparison G.8275.x
G.8275.defaultDS.localPriority 128
ptp_dst_mac 01:80:C2:00:00:0E
#
# Port Data Set
#
logAnnounceInterval -3
logSyncInterval -4
logMinDelayReqInterval -4
announceReceiptTimeout 3
delay_mechanism E2E
offset_from_master_min_threshold -50
offset_from_master_max_threshold 50
mean_path_delay_threshold 200
tsmonitor_num_ts 100
tsmonitor_num_log_sets 3
tsmonitor_num_log_entries 4
tsmonitor_log_wait_seconds 1
#
# Run time options
#
logging_level 6
path_trace_enabled 0
use_syslog 1
verbose 0
summary_interval 0
#
# servo parameters
#
pi_proportional_const 0.000000
pi_integral_const 0.000000
pi_proportional_scale 0.700000
pi_proportional_exponent -0.300000
pi_proportional_norm_max 0.700000
pi_integral_scale 0.300000
pi_integral_exponent 0.400000
pi_integral_norm_max 0.300000
step_threshold 0.000002
first_step_threshold 0.000020
max_frequency 900000000
sanity_freq_limit 0
#
# Default interface options
#
time_stamping software
# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
[swp1]
udp_ttl 1
masterOnly 0
delay_mechanism E2E
[swp2]
udp_ttl 1
masterOnly 0
delay_mechanism E2E
For a trunk VLAN, add the VLAN configuration to the switch port stanza: set l2_mode to trunk, vlan_intf to the VLAN interface, and src_ip to the IP address of the VLAN interface:
For a switch port VLAN, add the VLAN configuration to the switch port stanza: set l2_mode to access, vlan_intf to the VLAN interface, and src_ip to the IP address of the VLAN interface:
Cumulus Linux provides several ways to modify the default basic global configuration. You can:
Use profiles.
Modify the parameters directly with NVUE commands.
Modify the Linux /etc/ptp4l.conf file.
When a predefined profile is set, NVUE does not allow you to configure global parameters. Do not edit the Linux /etc/ptp4l.conf file to modify the global parameters when a predefined profile is in use. For information about profiles, see PTP Profiles.
Clock Domains
PTP domains allow different independent timing systems to be present in the same network without confusing each other. A PTP domain is a network or a portion of a network within which all the clocks synchronize. Every PTP message contains a domain number. A PTP instance works in only one domain and ignores messages that contain a different domain number. Cumulus Linux supports only one domain in the system.
You can specify multiple PTP clock domains. PTP isolates each domain from other domains so that each domain is a different PTP network. You can specify a number between 0 and 127.
The following example commands configure domain 3 when a profile is not set:
cumulus@switch:~$ nv set service ptp 1 domain 3
cumulus@switch:~$ nv config apply
Edit the Default Data Set section of the /etc/ptp4l.conf file to change the domainNumber setting, then restart the ptp4l service.
cumulus@switch:~$ sudo nano /etc/ptp4l.conf
[global]
#
# Default Data Set
#
slaveOnly 0
priority1 128
priority2 128
domainNumber 3
...
The BMC selects the PTP master according to the criteria in the following order:
Priority 1
Clock class
Clock accuracy
Clock variance
Priority 2
Port ID
Use the PTP priority to select the best master clock. You can set priority 1 and 2:
Priority 1 overrides the clock class and quality selection criteria to select the best master clock.
Priority 2 identifies primary and backup clocks among identical redundant Grandmasters.
The range for both priority1 and priority2 is between 0 and 255. The default priority is 128. For the boundary clock, use a number above 128. The lower priority applies first.
The following example commands set priority 1 and priority 2 to 200 when a profile is not set:
cumulus@switch:~$ nv set service ptp 1 priority1 200
cumulus@switch:~$ nv set service ptp 1 priority2 200
cumulus@switch:~$ nv config apply
Edit the Default Data Set section of the /etc/ptp4l.conf file to change the priority1 and, or priority2 setting, then restart the ptp4l service.
cumulus@switch:~$ sudo nano /etc/ptp4l.conf
[global]
#
# Default Data Set
#
slaveOnly 0
priority1 200
priority2 200
domainNumber 3
...
Use the local priority when you create a custom profile based on a Telecom profile (ITU 8275-1 or ITU 8275-2). Modify the local priority in a custom profile to set the local priority of the local clock. You can set a value between 0 and 255. The default priority is 128.
The following example command configures the local priority to 10 for the custom profile called CUSTOM1, which is based on ITU 8275-2:
cumulus@switch:~$ nv set service ptp 1 profile CUSTOM1 local-priority 10
cumulus@switch:~$ nv config apply
Edit the G.8275.defaultDS.localPriority option in the /etc/ptp4l.conf file. After you save the /etc/ptp4l.conf file, restart the ptp4l service.
Optional global PTP configuration includes configuring the DiffServ code point (DSCP). You can configure the DSCP value for all PTP IPv4 packets originated locally. You can set a value between 0 and 63.
cumulus@switch:~$ nv set service ptp 1 ip-dscp 22
cumulus@switch:~$ nv config apply
Edit the Default Data Set section of the /etc/ptp4l.conf file to change the dscp_event setting for PTP messages that trigger a timestamp read from the clock and the dscp_general setting for PTP messages that carry commands, responses, information, or timestamps.
After you save the /etc/ptp4l.conf file, restart the ptp4l service.
Cumulus Linux provides several ways to modify the default basic interface configuration. You can:
Use profiles
Modify the parameters directly with NVUE commands
Modify the Linux /etc/ptp4l.conf configuration file.
When a profile is in use, avoid configuring the following interface configuration parameters with NVUE or in the Linux configuration file so that the interface retains its profile settings.
Transport Mode
By default, Cumulus Linux encapsulates PTP messages in UDP IPV4 frames. To encapsulate PTP messages on an interface in UDP IPV6 frames:
cumulus@switch:~$ nv set interface swp1 ptp transport ipv6
cumulus@switch:~$ nv config apply
Edit the Default interface options section of the /etc/ptp4l.conf file to change the network_transport setting for the interface, then restart the ptp4l service.
cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping hardware
# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
[swp1]
udp_ttl 1
masterOnly 0
delay_mechanism E2E
network_transport UDPv6
[swp2]
udp_ttl 1
masterOnly 0
delay_mechanism E2E
network_transport UDPv6
...
Cumulus Linux supports the following PTP message modes:
Multicast, where the ports subscribe to two multicast addresses, one for event messages with timestamps and the other for general messages without timestamps. The Sync message that the master sends is a multicast message; all slave ports receive this message because the slaves need the time from the master. The slave ports in turn generate a Delay Request to the master. This is a multicast message that the intended master for the message and other slave ports receive. Similarly, all slave ports in addition to the intended slave port receive the master’s Delay Response. The slave ports receiving the unintended Delay Requests and Responses need to drop the packets. This can affect network bandwidth if there are hundreds of slave ports.
Mixed, where Sync and Announce messages are multicast messages but Delay Request and Response messages are unicast. This avoids the issue seen in multicast message mode where every slave port sees Delay Requests and Responses from every other slave port.
Unicast, where you configure the port as a unicast client or server. See Unicast Mode.
Multicast mode is the default setting; when you enable PTP on an interface, the message mode is multicast.
To change the message mode to mixed on swp1:
cumulus@switch:~$ nv set interface swp1 ptp mixed-multicast-unicast on
cumulus@switch:~$ nv config apply
To change the message mode back to the default setting of multicast on swp1:
cumulus@switch:~$ nv set interface swp1 ptp mixed-multicast-unicast off
cumulus@switch:~$ nv config apply
Edit the Default interface options section of the /etc/ptp4l.conf file to add the hybrid_e2e 1 line under the interface, then restart the ptp4l service.
cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping hardware
# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
[swp1]
hybrid_e2e 1
...
To change the message mode back to the default setting of multicast, remove the hybrid_e2e line under the interface, then restart the ptp4l service.
PTP Interface Timers
You can set the following timers for PTP messages.
Timer
Description
announce-interval
The average interval between successive Announce messages. Specify the value as a power of two in seconds.
announce-timeout
The number of announce intervals that have to occur without receiving an Announce message before a timeout occurs. Make sure that this value is longer than the announce-interval in your network.
delay-req-interval
The minimum average time interval allowed between successive Delay Required messages.
sync-interval
The interval between PTP synchronization messages on an interface. Specify the value as a power of two in seconds.
To set the timers with NVUE, run the nv set interface <interface> ptp timers <timer> <value> command.
To set the timers with Linux commands, edit the /etc/ptp4l.conf file and set the timers in the Default interface options section.
The following example sets the announce interval between successive Announce messages on swp1 to -1.
Edit the Default interface options section of the /etc/ptp4l.conf file:
To set the announce interval between successive Announce messages on swp1 to -1, change the logAnnounceInterval setting for the interface to -1.
To set the mean sync-interval for multicast messages on swp1 to -5, change the logSyncInterval setting for the interface to -5.
After you edit the /etc/ptp4l.conf file, restart the ptp4l service.
cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping hardware
# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
[swp1]
logAnnounceInterval -1
logSyncInterval -5
udp_ttl 20
masterOnly 1
delay_mechanism E2E
...
Set the local priority on an interface for a profile that uses ITU 8275-1 or ITU 8275-2. You can set a value between 0 and 255. The default priority is 128.
The following example sets the local priority on swp1 to 10.
By default, PTP ports are in auto mode, where the BMC algorithm determines the state of the port.
You can configure Forced Master mode on a PTP port so that it is always in a master state and the BMC algorithm does not run for this port. This port ignores any Announce messages it receives.
cumulus@switch:~$ nv set interface swp1 ptp forced-master on
cumulus@switch:~$ nv config apply
Edit the Default interface options section of the /etc/ptp4l.conf file to change the masterOnly setting for the interface, then restart the ptp4l service.
cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping hardware
# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
[swp1]
udp_ttl 1
masterOnly 1
delay_mechanism E2E
...
Edit the Default interface options section of the /etc/ptp4l.conf file to change the udp_ttl setting for the interface, then restart the ptp4l service.
cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping hardware
# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
[swp1]
udp_ttl 20
masterOnly 1
delay_mechanism E2E
...
Cumulus Linux supports unicast mode so that a unicast client can perform Unicast Discover and Negotiation with servers. Unlike the default multicast mode, where both the server(master) and client(slave) start sending out announce requests and discover each other, in unicast mode, the client starts by sending out requests for unicast transmission. The client sends this to every server address in its Unicast Master Table. The server responds with an accept or deny to the request.
Global Unicast Configuration
Unicast clients need a unicast master table for unicast negotiation; you must configure at least one unicast master table on the switch.
To configure unicast globally:
Set the unicast table ID; a unique ID that identifies the unicast master table.
Set the unicast master address. You can set more than one unicast master address, which can be an IPv4, IPv6, or MAC address.
Optional: Set the unicast master query interval, which is the mean interval between requests for Announce messages. Specify this value as a power of two in seconds. You can specify a value between -3 and 4. The default value is -0 (2 power).
cumulus@switch:~$ nv set service ptp 1 unicast-master 1 address 10.10.10.1
cumulus@switch:~$ nv set service ptp 1 unicast-master 1 query-interval 4
cumulus@switch:~$ nv set interface swp1 ptp unicast-master-table-id 1
cumulus@switch:~$ nv config apply
Add the following lines at the end of the # Default interface options section of the /etc/ptp4l.conf file:
For interface unicast configuration, in addition to enabling PTP on an interface, you also need to configure the PTP interface to be either a unicast client or a unicast server.
When configuring multiple PTP interfaces on the switch to be unicast clients, you must configure a unicast table ID on every interface set as a unicast client. Each client must have a different table ID.
To configure a PTP interface to be the unicast client:
To show the unicast master table configuration on the switch, run the nv show service ptp <instance-id> unicast-master <table-id> command.
Optional Unicast Interface Configuration
You can set the unicast request duration for unicast clients, which is the service time in seconds requested by the unicast client during unicast negotiation. The default value is 300 seconds.
PTP profiles are a standardized set of configurations and rules intended to meet the requirements of a specific application. Profiles define required, allowed, and restricted PTP options, network restrictions, and performance requirements.
Cumulus Linux supports the following predefined profiles:
IEEE 1588
ITU 8275-1
ITU 8275-2
Application
Enterprise
Mobile Networks
Mobile Networks
Transport
Layer 2 and Layer 3
Layer 2
Layer 3
Encapsulation
802.3, UDPv4, or UDPv6
802.3
UDPv4 or UDPv6
Transmission
Unicast and Multicast
Multicast
Unicast
Supported Clock Types
Boundary Clock
Boundary Clock
Boundary Clock
You cannot modify the predefined profiles. If you want to set a parameter to a different value in a predefined profile, you need to create a custom profile. You can modify a custom profile within the range applicable to the profile type.
You cannot set the current profile to a profile not yet created.
You cannot set global PTP parameters in a profile currently in use.
PTP profiles do not support VLANs or bonds.
If you set a predefined or custom profile, do not change any global PTP settings, such as the DSCP or the clock domain.
For better performance in a high scale network with PTP on multiple interfaces, configure a higher system policer rate with the nv set system control-plane policer lldp burst <value> and nv set system control-plane policer lldp rate <value> commands. The switch uses the LLDP policer for PTP protocol packets. The default value for the LLDP policer is 2500. When you use the ITU 8275.1 profile with higher sync rates, use higher policer values.
Set a Predefined Profile
To set a predefined profile:
To set the ITU 8275.1 profile, run the nv set service ptp <instance-id> current-profile default-itu-8275-1 command.
To set the ITU 8275.2 profile, run the nv set service ptp <instance-id> current-profile default-itu-8275-2 command.
The following example sets the profile to ITU 8275.1
cumulus@switch:~$ nv set service ptp 1 current-profile default-itu-8275-1
cumulus@switch:~$ nv config apply
To set the IEEE 1588 profile:
cumulus@switch:~$ nv set service ptp 1 current-profile default-1588
cumulus@switch:~$ nv config apply
To set the predefined ITU 8275.1 profile, edit the /etc/ptp4l.conf file and set the parameters shown below, then restart the ptp4l service:
Set the profile type on which to base the new profile (itu-g-8275-1itu-g-8275-2, or ieee-1588).
Update any of the profile settings you want to change (announce-interval, delay-req-interval, priority1, sync-interval, announce-timeout, domain, priority2, transport, delay-mechanism, local-priority).
Set the custom profile to be the current profile.
The following example commands create a custom profile called CUSTOM1 based on the predefined profile ITU 8275-1. The commands set the domain to 28 and the announce-timeout to 3, then set CUSTOM1 to be the current profile:
cumulus@switch:~$ nv set service ptp 1 profile CUSTOM1
cumulus@switch:~$ nv set service ptp 1 profile CUSTOM1 profile-type itu-g-8275-1
cumulus@switch:~$ nv set service ptp 1 profile CUSTOM1 domain 28
cumulus@switch:~$ nv set service ptp 1 profile CUSTOM1 announce-timeout 3
cumulus@switch:~$ nv set service ptp 1 current-profile CUSTOM1
cumulus@switch:~$ nv config apply
The following example /etc/ptp4l.conf file creates a custom profile based on the predefined profile ITU 8275-1 and sets the domain to 28 and the announce-timeout to 3.
cumulus@switch:~$ sudo nano /etc/ptp4l.conf
[global]
#
# Default Data Set
#
slaveOnly 0
priority1 128
priority2 128
domainNumber 28
dscp_event 46
dscp_general 46
network_transport L2
dataset_comparison G.8275.x
G.8275.defaultDS.localPriority 128
ptp_dst_mac 01:80:C2:00:00:0E
#
# Port Data Set
#
logAnnounceInterval 5
logSyncInterval -4
logMinDelayReqInterval -4
announceReceiptTimeout 3
delay_mechanism E2E
offset_from_master_min_threshold -50
offset_from_master_max_threshold 50
mean_path_delay_threshold 200
tsmonitor_num_ts 100
tsmonitor_num_log_sets 3
tsmonitor_num_log_entries 4
tsmonitor_log_wait_seconds 1
#
# Run time options
#
logging_level 6
path_trace_enabled 0
use_syslog 1
verbose 0
summary_interval 0
#
# servo parameters
#
pi_proportional_const 0.000000
pi_integral_const 0.000000
pi_proportional_scale 0.700000
pi_proportional_exponent -0.300000
pi_proportional_norm_max 0.700000
pi_integral_scale 0.300000
pi_integral_exponent 0.400000
pi_integral_norm_max 0.300000
step_threshold 0.000002
first_step_threshold 0.000020
max_frequency 900000000
sanity_freq_limit 0
#
# Default interface options
#
time_stamping software
# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
[swp1]
udp_ttl 1
masterOnly 0
delay_mechanism E2E
[swp2]
udp_ttl 1
masterOnly 0
delay_mechanism E2E
To show the current PTP profile setting, run the nv show service ptp <ptp-instance> command:
cumulus@switch:~$ nv show service ptp 1
operational applied description
--------------------------- ----------- ------------------ --------------------------------------------------------------------
enable on on Turn the feature 'on' or 'off'. The default is 'off'.
current-profile default-itu-8275-1 Current PTP profile index
domain 24 0 Domain number of the current syntonization
ip-dscp 46 46 Sets the Diffserv code point for all PTP packets originated locally.
priority1 128 128 Priority1 attribute of the local clock
priority2 128 128 Priority2 attribute of the local clock
...
To show the settings for a profile, run the nv show service ptp <instance> profile <profile-name> command:
The acceptable master table option is a security feature that prevents a rogue player from pretending to be the grandmaster clock to take over the PTP network. To use this feature, you configure the clock IDs of known grandmaster clocks in the acceptable master table and set the acceptable master table option on a PTP port. The BMC algorithm checks if the grandmaster clock received in the Announce message is in this table before proceeding with the master selection. Cumulus Linux disables this option by default on PTP ports.
The following example command adds the grandmaster clock ID 24:8a:07:ff:fe:f4:16:06 to the acceptable master table and enables the PTP acceptable master table option for swp1:
cumulus@switch:~$ nv set service ptp 1 acceptable-master 24:8a:07:ff:fe:f4:16:06
cumulus@switch:~$ nv config apply
You can also configure an alternate priority 1 value for the Grandmaster:
cumulus@switch:~$ nv set service ptp 1 acceptable-master 24:8a:07:ff:fe:f4:16:06 alt-priority 2
To enable the PTP acceptable master table option for swp1:
cumulus@switch:~$ nv set interface swp1 ptp acceptable-master on
cumulus@switch:~$ nv config apply
Edit the Default interface options section of the /etc/ptp4l.conf file to add acceptable_master_clockIdentity 248a07.fffe.f41606.
To enable the PTP acceptable master table option for swp1, add acceptable_master on under [swp1].
...
# Default interface options
#
time_stamping hardware
# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
[swp1]
udp_ttl 20
masterOnly 1
delay_mechanism E2E
acceptable_master on
...
Cumulus Linux provides the following optional PTP monitoring configuration.
Configure Clock Correction and Path Delay Thresholds
Cumulus Linux monitors clock correction and path delay against thresholds, and generates counters when PTP reaches the set thresholds. You can see the counters in the NVUE nv show command output and in log messages.
You can configure the following monitor settings:
Command
Description
nv set service ptp <instance> monitor min-offset-threshold
Sets the minimum difference allowed between the master and slave time. You can set a value between -1000000000 and 0 nanoseconds. The default value is -50 nanoseconds.
nv set service ptp <instance> monitor max-offset-threshold
Sets the maximum difference allowed between the master and slave time. You can set a value between 0 and 1000000000 nanoseconds. The default value is 50 nanoseconds.
nv set service ptp <instance> monitor path-delay-threshold
Sets the mean time that PTP packets take to travel between the master and slave. You can set a value between 0 and 1000000000 nanoseconds. The default value is 200 nanoseconds.
nv set service ptp <instance> monitor max-timestamp-entries
Sets the maximum number of timestamp entries allowed. Cumulus Linux updates the timestamps continuously. You can specify a value between 100 and 200. The default value is 100 entries.
The following example sets the minimum offset threshold to -1000, the maximum offset threshold to 1000, and the path delay threshold to 300:
cumulus@switch:~$ nv set service ptp 1 monitor min-offset-threshold -1000
cumulus@switch:~$ nv set service ptp 1 monitor max-offset-threshold 1000
cumulus@switch:~$ nv set service ptp 1 monitor path-delay-threshold 300
cumulus@switch:~$ nv config apply
You can configure the following monitor settings manually in the /etc/ptp4l.conf file. Be sure to run the sudo systemctl restart ptp4l.service to apply the settings.
Parameter
Description
offset_from_master_min_threshold
Sets the minimum difference allowed between the master and slave time. You can set a value between -1000000000 and 0 nanoseconds. The default value is -50 nanoseconds.
offset_from_master_max_threshold
Sets the maximum difference allowed between the master and slave time. You can set a value between 0 and 1000000000 nanoseconds. The default value is 50 nanoseconds.
mean_path_delay_threshold
Sets the mean time that PTP packets take to travel between the master and slave. You can set a value between 0 and 1000000000 nanoseconds. The default value is 200 nanoseconds.
The following example sets the minimum offset threshold to -1000, the maximum offset threshold to 1000, and the path delay threshold to 300:
A log set contains the log entries for clock correction and path delay violations at different times. You can set the number of entries to log and the interval between successive violation logs.
Command
Description
nv set service ptp 1 monitor max-violation-log-sets
Sets the maximum number of log sets allowed. You can specify a value between 2 and 4. The default value is 3.
nv set service ptp 1 monitor max-violation-log-entries
Sets the maximum number of log entries allowed in a log set. You can specify a value between 4 and 8. The default value is 4.
nv set service ptp 1 monitor violation-log-interval
Sets the number of seconds to wait before logging back-to-back violations. You can specify a value between 0 and 60. The default value is 1.
The following example sets the maximum number of log sets allowed to 4, the maximum number of log entries allowed to 6, and the violation log interval to 10:
cumulus@switch:~$ nv set service ptp 1 monitor max-violation-log-sets 4
cumulus@switch:~$ nv set service ptp 1 monitor max-violation-log-entries 6
cumulus@switch:~$ nv set service ptp 1 monitor violation-log-interval 10
cumulus@switch:~$ nv config apply
You can configure the following monitor settings manually in the /etc/ptp4l.conf file. Be sure to run the sudo systemctl restart ptp4l.service to apply the settings.
Parameter
Description
tsmonitor_num_log_sets
Sets the maximum number of log sets allowed. You can specify a value between 2 and 4. The default value is 3.
tsmonitor_num_log_entries
Sets the maximum number of log entries allowed in a log set. You can specify a value between 4 and 8. The default value is 4.
tsmonitor_log_wait_seconds
Sets the number of seconds to wait before logging back-to-back violations. You can specify a value between 0 and 60. The default value is 1.
The following example sets the maximum number of log sets allowed to 4, the maximum number of log entries allowed to 6, and the violation log interval to 10:
To delete PTP configuration, delete the PTP master and slave interfaces. The following example commands delete the PTP interfaces swp1, swp2, and swp3.
Edit the /etc/ptp4l.conf file to remove the interfaces from the Default interface options section, then restart the ptp4l service.
cumulus@switch:~$ sudo nano /etc/ptp4l.conf
...
# Default interface options
#
time_stamping hardware
# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
You can drill down with the following nv show service ptp <instance> commands:
nv show service ptp <instance> acceptable-master shows acceptable master configuration.
nv show service ptp <instance> clock-quality shows the clock quality status.
nv show service ptp <instance> current shows the local states learned during PTP message exchange.
nv show service ptp <instance> domain shows the domain configuration.
nv show service ptp <instance> ip-dscp shows PTP DSCP configuration.
nv show service ptp <instance> monitor shows PTP monitor configuration.
nv show service ptp <instance> profile shows PTP profile configuration.
nv show service ptp <instance> parent shows the local states learned during PTP message exchange.
nv show service ptp <instance> priority1 shows PTP priority1 configuration.
nv show service ptp <instance> priority2 shows PTP priority2 configuration.
nv show service ptp <instance> status shows the status of all PTP interfaces.
nv show service ptp <instance> time-properties shows the clock time attributes.
nv show service ptp <instance> unicast-master shows the unicast master configuration.
Show PTP Interface Configuration
To check configuration for a PTP interface, run the nv show interface <interface> ptp command.
cumulus@switch:~$ nv show interface swp1 ptp
operational applied description
------------------------- ----------- ---------- ----------------------------------------------------------------------
enable on Turn the feature 'on' or 'off'. The default is 'off'.
acceptable-master off Determines if acceptable master check is enabled for this interface.
delay-mechanism end-to-end end-to-end Mode in which PTP message is transmitted.
forced-master off off Configures PTP interfaces to forced master state.
instance 1 PTP instance number.
mixed-multicast-unicast off Enables Multicast for Announce, Sync and Followup and Unicast for D...
transport ipv4 ipv4 Transport method for the PTP messages.
ttl 1 1 Maximum number of hops the PTP messages can make before it gets dro...
unicast-request-duration 300 The service time in seconds to be requested during discovery.
timers
announce-interval 0 0 Mean time interval between successive Announce messages. It's spec...
announce-timeout 3 3 The number of announceIntervals that have to pass without receipt o...
delay-req-interval -3 -3 The minimum permitted mean time interval between successive Delay R...
sync-interval -3 -3 The mean SyncInterval for multicast messages. It's specified as a...
peer-mean-path-delay 0 An estimate of the current one-way propagation delay on the link wh...
port-state master State of the port
protocol-version 2 The PTP version in use on the port
Show PTP Counters
To show PTP counters for an interface, run the nv show interface <interface> counters ptp command:
To show the status of all PTP interfaces, run the nv show service ptp <instance> status command.
The command output shows the PTP enabled ports, the PTP port mode (unicast or multicast), the state of the port based on BMCA, the unicast state, and identifies the server address to which the client connects.
cumulus@switch:~$ nv show service ptp 1 status
Port Mode State Ustate Server
----- ----- ------- ------------------------------- -------
swp9 Ucast SLAVE Sync and Delay Granted (H_SYDY) 9.9.9.2
swp10 Ucast PASSIVE Initial State (WAIT)
swp11 Ucast PASSIVE Initial State (WAIT)
swp12 Ucast PASSIVE Initial State (WAIT)
Show the List of NVUE PTP Commands
To see a full list of NVUE show commands for PTP, run the nv list-commands service ptp command.
To show a full list of show commands for a PTP interface, run the nv list-commands | grep 'nv show interface <interface-id> ptp' command.
cumulus@switch:~$ nv list-commands service ptp
nv show service ptp
nv show service ptp <instance-id>
nv show service ptp <instance-id> status
nv show service ptp <instance-id> domain
nv show service ptp <instance-id> priority1
nv show service ptp <instance-id> priority2
nv show service ptp <instance-id> ip-dscp
nv show service ptp <instance-id> acceptable-master
...
cumulus@switch:~$ nv list-commands | grep 'nv show interface <interface-id> ptp'
...
nv show interface <interface-id> ptp
nv show interface <interface-id> ptp timers
nv show interface <interface-id> ptp shaper
...
Example Configuration
In the following example, the boundary clock on the switch receives time from Master 1 (the grandmaster) on PTP slave port swp1, sets its clock and passes the time down through PTP master ports swp2, swp3, and swp4 to the hosts that receive the time.
The following example configuration assumes that you have already configured the layer 3 routed interfaces (swp1, swp2, swp3, and swp4) you want to use for PTP.
cumulus@switch:~$ nv set service ptp 1 enable on
cumulus@switch:~$ nv set service ptp 1 priority2 254
cumulus@switch:~$ nv set service ptp 1 priority1 254
cumulus@switch:~$ nv set service ptp 1 domain 3
cumulus@switch:~$ nv set interface swp1 ptp enable on
cumulus@switch:~$ nv set interface swp2 ptp enable on
cumulus@switch:~$ nv set interface swp3 ptp enable on
cumulus@switch:~$ nv set interface swp4 ptp enable on
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
interface:
lo:
ip:
address:
10.10.10.1/32: {}
type: loopback
swp1:
ptp:
enable: on
type: swp
swp2:
ptp:
enable: on
type: swp
swp3:
ptp:
enable: on
type: swp
swp4:
ptp:
enable: on
type: swp
service:
ptp:
'1':
domain: 3
enable: on
priority1: 254
priority2: 254
cumulus@switch:~$ sudo cat /etc/ptp4l.conf
...
[global]
#
# Default Data Set
#
slaveOnly 0
priority1 254
priority2 254
domainNumber 3
dscp_event 46
dscp_general 46
offset_from_master_min_threshold -50
offset_from_master_max_threshold 50
mean_path_delay_threshold 200
tsmonitor_num_ts 100
tsmonitor_num_log_sets 2
tsmonitor_num_log_entries 4
tsmonitor_log_wait_seconds 1
#
# Run time options
#
logging_level 6
path_trace_enabled 0
use_syslog 1
verbose 0
summary_interval 0
#
# servo parameters
#
pi_proportional_const 0.000000
pi_integral_const 0.000000
pi_proportional_scale 0.700000
pi_proportional_exponent -0.300000
pi_proportional_norm_max 0.700000
pi_integral_scale 0.300000
pi_integral_exponent 0.400000
pi_integral_norm_max 0.300000
step_threshold 0.000002
first_step_threshold 0.000020
max_frequency 900000000
sanity_freq_limit 0
#
# Default interface options
#
time_stamping software
# Interfaces in which ptp should be enabled
# these interfaces should be routed ports
# if an interface does not have an ip address
# the ptp4l will not work as expected.
[swp41
udp_ttl 1
masterOnly 0
delay_mechanism E2E
network_transport UDPv4
[swp2]
udp_ttl 1
masterOnly 0
delay_mechanism E2E
network_transport UDPv4
[swp3]
udp_ttl 1
masterOnly 0
delay_mechanism E2E
network_transport UDPv4
[swp4]
udp_ttl 1
masterOnly 0
delay_mechanism E2E
network_transport UDPv4
Considerations
PTP Traffic Shaping
To improve performance on the NVIDIA Spectrum 1 switch for PTP-enabled ports with speeds lower than 100G, you can enable a pre-defined traffic shaping profile. For example, if you see that the PTP timing offset varies widely and does not stabilize, enable PTP shaping on all PTP enabled ports to reduce the bandwidth on the ports slightly and improve timing stabilization.
Switches with Spectrum-2 and later do not support PTP shaping.
Bonds do not support PTP shaping.
You cannot configure QoS traffic shaping and PTP traffic shaping on the same ports.
You must configure a strict priority for PTP traffic; for example:
cumulus@switch:~$ nv set qos egress-scheduler default-global traffic-class 0-5,7 mode dwrr
cumulus@switch:~$ nv set qos egress-scheduler default-global traffic-class 0-5,7 bw-percent 12
cumulus@switch:~$ nv set qos egress-scheduler default-global traffic-class 6 mode strict
For each PTP-enabled port on which you want to set traffic shaping, run the nv set interface <interface> ptp shaper enable on command.
cumulus@switch:~$ nv set interface swp1 ptp shaper enable on
cumulus@switch:~$ nv set interface swp2 ptp shaper enable on
cumulus@switch:~$ nv config apply
To see the PTP shaping setting for an interface, run the nv show interface <interface> ptp shaper command:
cumulus@switch:~$ nv show interface swp1 ptp shaper
operational applied
------ ----------- -------
enable on
In the /etc/cumulus/switchd.d/ptp_shaper.conf file, set the following parameters for the interfaces to which you want to apply traffic shaping and enable the traffic shaper. You must reload switchd for the changes to take effect.
PTP frames are affected by STP filtering; events, such as an STP topology change (where ports temporarily go into the blocking state), can cause interruptions to PTP communications.
If you configure PTP on bridge ports, NVIDIA recommends that the bridge ports are spanning tree edge ports or in a bridge domain where spanning tree is disabled.
SyncE
SyncE is currently in Beta.
SyncE is a standard for transmitting clock signals over the Ethernet physical layer to synchronize clocks across the network by propagating frequency using the transmission rate of symbols in the network. A dedicated Ethernet channel manages this synchronization.
The Cumulus Linux switch includes a SyncE controller and a SyncE daemon.
The SyncE controller reads performance counters to calculate the differences between transmit and receive ethernet symbols on the physical layer to fine tune the clock frequency.
The SyncE daemon (synced):
Manages transmitting and receiving SSMs on all SyncE enabled ports using the Ethernet Synchronization Messaging Channel (ESMC).
Manages the synchronization hierarchy and runs the master selection algorithm to choose the best reference clock from the QL in the SSM.
Cumulus Linux supports SyncE for the NVIDIA SN3750-SX switch only.
SyncE on 1G interfaces only supports 1000BASE-SX transceivers, 1000BASE-LX transceivers, and ADVA 5401 GrandMaster transceivers.
Basic Configuration
Basic SyncE configuration requires you:
Enable SyncE on the switch.
Configure SyncE on at least one interface so that the interface is a timing source that passes to the selection algorithm.
The basic configuration shown below uses the default SyncE settings:
The amount of time SyncE waits after the interface comes up before using the interface for synchronization is set to 5 minutes.
cumulus@switch:~$ nv set service synce enable on
cumulus@switch:~$ nv set interface swp2 synce enable on
cumulus@switch:~$ nv config apply
Edit the /etc/synced/synced.conf file to configure the interface, then enable and start the SyncE service. Adding an interface section in the /etc/synced/synced.conf file enables SyncE on that interface.
The following example enables SyncE on swp1, swp2, swp3.
The wait to restore time is the number of seconds SyncE waits for each port to be up before opening the Ethernet Synchronization Message Channel (ESMC) for messages. You can set a value betwen 0 and 720 (12) minutes. The default value is 300 seconds (5 minutes).
The following command example sets the wait to restore time to 180 seconds (3 minutes):
cumulus@switch:~$ nv set service synce wait-to-restore-time 180
cumulus@switch:~$ nv config apply
Edit the /etc/synced/synced.conf file to change the twtr_seconds setting, then restart the SyncE service.
You can set the priority for the clock source. The lowest priority is 1 and the highest priority is 256. If two clock sources have the same priority, the switch uses the lowest clock source.
The following example command sets the priority to 256:
cumulus@switch:~$ nv set service synce provider-default-priority 256
cumulus@switch:~$ nv config apply
Edit the /etc/synced/synced.conf file to change the priority setting, then restart the SyncE service.
The clock selection algorithm uses the frequency source priority on an interface to choose between two sources that have the same QL. You can specify a value between 1 (the highest priority) and 256 (the lowest priority). The default value is 1.
The following command example sets the priority on swp2 to 10, on swp2 to 20, and on swp3 to 10:
cumulus@switch:~$ nv set interface swp1 synce provider-priority 10
cumulus@switch:~$ nv set interface swp2 synce provider-priority 20
cumulus@switch:~$ nv set interface swp3 synce provider-priority 10
cumulus@switch:~$ nv config apply
Edit the /etc/synced.conf file to change the priority setting for the interface, then restart the SyncE service.
To show global SyncE configuration, run the NVUE nv show service synce command or the Linux syncectl show status command.
To show SyncE configuration for a specific interface, run the NVUE nv show interface <interface-id> synce command or the Linux syncectl show interface status <interface> command.
cumulus@switch:~$ nv show service synce
operational applied
------------------------- ----------------------------------------------------------------- -------
enable On on
log-level notice
provider-default-priority 10 10
wait-to-restore-time 40 40
clock-identity 0x849e00fffe00ca00
local-clock-quality eec1
network-type 1
summary Group #0: TRACKING holdover acquired on swp1. freq_diff: 77 (ppb)
To show SyncE statistics for a specific interface, run the NVUE nv show interface <interface-id> synce counters command or the Linux syncectl show interface counters <interface command:
To clear counters for a specific SyncE interface, run the NVUE nv action clear interface <interface> synce counters command or the Linux syncectl clear interface counters <interface> command.
This section describes how to set up user accounts and ssh for remote access, and configure LDAP authentication, TACACS+, and RADIUS AAA.
SSH for Remote Access
Cumulus Linux uses the OpenSSH package to provide access to the system using the Secure Shell (SSH) protocol. With SSH, you can use key pairs instead of passwords to gain access to the system.
This section describes how to generate an SSH key pair on one system and install the key as an authorized key in another system.
Generate an SSH Key Pair
To generate an SSH key pair, run the ssh-keygen command and follow the prompts.
To configure the system without a password, do not enter a passphrase when prompted in the following step.
cumulus@host01:~$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/cumulus/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/cumulus/.ssh/id_rsa.
Your public key has been saved in /home/cumulus/.ssh/id_rsa.pub.
The key fingerprint is:
5a:b4:16:a0:f9:14:6b:51:f6:f6:c0:76:1a:35:2b:bb cumulus@leaf04
The key's randomart image is:
+---[RSA 2048]----+
| +.o o |
| o * o . o |
| o + o O o |
| + . = O |
| . S o . |
| + . |
| . E |
| |
| |
+-----------------+
Install an Authorized SSH Key
To install an authorized SSH key, you take the contents of an SSH public key and add it to the SSH authorized key file (~/.ssh/authorized_keys) of the user.
A public key is a text file with three space separated fields:
<type> <key string> <comment>
Field
Description
<type>
The algorithm you want to use to hash the key. The algorithm can be ecdsa-sha2-nistp256, ecdsa-sha2-nistp384, ecdsa-sha2-nistp521, ssh-dss, ssh-ed25519, or ssh-rsa (the default value).
<key string>
A base64 format string for the key.
<comment>
A single word string. By default, this is the name of the system that generated the key. NVUE uses the <comment> field as the key name.
The procedure to install an authorized SSH key is different based on whether the user is an NVUE managed user, a non-NVUE managed user, or the root user.
NVUE Managed User
The following example adds an authorized key named prod_key to the user admin2. The content of the public key file is ssh-rsa 1234 prod_key.
cumulus@leaf01:~$ nv set system aaa user admin2 ssh authorized-key prod_key key XABDB3NzaC1yc2EAAAADAQABAAABgQCvjs/RFPhxLQMkckONg+1RE1PTIO2JQhzFN9TRg7ox7o0tfZ+IzSB99lr2dmmVe8FRWgxVjc...
cumulus@leaf01:~$ nv set system aaa user admin2 ssh authorized-key prod_key type ssh-rsa
cumulus@leaf01:~$ nv config apply
Non-NVUE Managed User
The following example adds an authorized key file from the account cumulus on a host to the cumulus account on the switch:
To copy a previously generated public key to the desired location, run the ssh-copy-id command and follow the prompts:
cumulus@host01:~$ ssh-copy-id -i /home/cumulus/.ssh/id_rsa.pub cumulus@leaf02
The authenticity of host 'leaf02 (192.168.0.11)' can't be established.
ECDSA key fingerprint is b1:ce:b7:6a:20:f4:06:3a:09:3c:d9:42:de:99:66:6e.
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
cumulus@leaf01's password:
Number of key(s) added: 1
The ssh-copy-id command does not work if the username on the remote switch is different from the username on the local switch. To work around this issue, use the scp command instead:
cumulus@host01:~$ scp .ssh/id_rsa.pub cumulus@leaf02:.ssh/authorized_keys
Enter passphrase for key '/home/cumulus/.ssh/id_rsa':
id_rsa.pub
Connect to the remote switch to confirm that the authentication keys are in place:
cumulus@leaf01:~$ ssh cumulus@leaf02
Welcome to Cumulus VX (TM)
Cumulus VX (TM) is a community supported virtual appliance designed for
experiencing, testing and prototyping the latest technology.
For any questions or technical support, visit our community site at:
http://community.cumulusnetworks.com
The registered trademark Linux (R) is used pursuant to a sublicense from LMI,
the exclusive licensee of Linus Torvalds, owner of the mark on a world-wide basis.
Last login: Thu Sep 29 16:56:54 2016
Root User
By default, the root account cannot use SSH to log in. To add an authorized SSH key to the root account:
As a privileged user (such as the cumulus user), either echo the public key contents and redirect the contents to the authorized key file or copy the public key file to the switch, then copy it to the root account (with privilege escalation).
To echo the public key contents and redirect the contents to the authorized key file:
cumulus@switch:~$ echo "<SSH public key contents>" | sudo tee -a /root/.ssh/authorized_keys
cumulus@switch:~$ sudo chmod 0644 /root/.ssh/authorized_keys
To copy the public key file to the switch, then copy it to the root account:
The SSH service runs in the default VRF on the switch but listens on all interfaces in all VRFs. To limit SSH to listen on just one VRF, you need to bind the SSH service to that VRF.
The following example configures SSH to listen only on the management VRF:
To configure SSH to listen to only one IP address or a subnet in a VRF, you need to bind the service to that VRF (as above), then set the ListenAddress parameter in the /etc/ssh/sshd_config file to the IP address or subnet in that VRF.
You can only run one SSH service on the switch at a time.
User Accounts
By default, Cumulus Linux has two user accounts: cumulus and root.
The cumulus account:
Uses the default password cumulus. You must change the default password when you log into Cumulus Linux for the first time.
Is a user account in the sudo group with sudo privileges.
Can log in to the system through all the usual channels, such as console and SSH.
Includes permissions to run NVUE nv show, nv set, nv unset, and nv apply commands.
The root account:
Has the default password disabled by default and prevents you from using SSH, telnet, FTP, and so on, to log in to the switch.
Has the standard Linux root user access to everything on the switch.
Add a New User Account
You can add additional user accounts as needed.
You control local user account access to NVUE commands by changing the group membership (role) for a user. Like the cumulus account, these accounts must be in the sudo group or include the NVUE system-admin role to execute privileged commands.
You can set a plain text password or a hashed password for the local user account. To access the switch without a password, you need to boot into single user mode.
You can provide a full name for the local user account (optional).
Use the following roles to set the permissions for local user accounts.
Role
Permissions
system-admin
Allows the user to use sudo to run commands as the privileged user, run nv show commands, run nv set and nv unset commands to stage configuration changes, and run nv apply commands to apply configuration changes.
nvue-admin
Allows the user to run nv show commands, run nv set and nv unset commands to stage configuration changes, and run nv apply commands to apply configuration changes.
nvue-monitor
Allows the user to run nv show commands only.
The following example:
Creates a new user account called admin2 and sets the role to system-admin (permissions for sudo, nv show, nv set and nvunset, and nv apply).
Sets a plain text password. NVUE hashes the plain text password and stores the value as a hashed password. To set a hashed password, see Hashed Passwords, below.
Adds the full name FIRST LAST. If the full name includes more than one name, either separate the names with a hyphen (FIRST-LAST) or enclose the full name in quotes ("FIRST LAST").
cumulus@switch:~$ nv set system aaa user admin2 role system-admin
cumulus@switch:~$ nv set system aaa user admin2 password
Enter new password:
Confirm password:
cumulus@switch:~$ nv set system aaa user admin2 full-name "FIRST LAST"
cumulus@switch:~$ nv config apply
You can also run the nv set system aaa user <user> password <plain-text-password> command to specify the plain text password inline. This command bypasses the Enter new password and Confirm password prompts but displays the plain text password as you type it.
If you are an NVUE-managed user, you can update your own password with the Linux passwd command.
Use the following groups to set permissions for local user accounts. To add users to these groups, use the useradd(8) or usermod(8) commands:
Group
Permissions
sudo
Allows the user to use sudo to run commands as the privileged user.
nvshow
Allows the user to run nv show commands only.
nvset
Allows the user to run nv show commands, and run nv set and nv unset commands to stage configuration changes.
nvapply
Allows the user to run nv show commands, run nv set and nv unset commands to stage configuration changes, and run nv apply commands to apply configuration changes.
The following example:
Creates a new user account called admin2, creates a home directory for the user, and adds the full name First Last.
Securely sets the password for the user with passwd.
Sets the group membership (role) to sudo and nvapply (permissions to use sudo, nv show, nv set, and nv apply).
When you use Linux commands to add a new user, you must create a home directory for the user with the -m option. NVUE commands create a home directory automatically.
Only the following user accounts can create, modify, and delete other system-admin accounts:
NVUE-managed users with the system-admin role.
The root user.
Non NVUE-managed users that are in the sudo group.
Hashed Passwords
Instead of a plain text password, you can provide a hashed password for a local user.
You must specify the hashed password in Linux crypt format; the password must be a minimum of 15 to 20 characters long and must include special characters, digits, lower case alphabetic letters, and more. Typically, the password format is set to $id$salt$hashed, where $id is the hashing algorithm. In GNU or Linux:
$1$ is MD5
$2a$ is Blowfish
$2y$ is Blowfish
$5$ is SHA-256
$6$ is SHA-512
To generate a hashed password on the switch, you can either run a python3 command or install and use the mkpasswd utility:
Run the following command on the switch or Linux host. When prompted, enter the plain text password you want to hash:
To generate a hashed password for SHA-512, SHA256, or MD5 encryption, run the following command. When prompted, enter the plain text password you want to hash:
Hashed password strings contain characters, such as $, that have a special meaning in the Linux shell; you must enclose the hashed password in single quotes (').
Delete a User Account
To delete a user account:
Run the nv unset system aaa user <user> command. The following example deletes the user account called admin2.
cumulus@switch:~$ nv unset system aaa user admin2
cumulus@switch:~$ nv config apply
Run the sudo userdel <user> command. The following example deletes the user account called admin2.
cumulus@switch:~$ sudo userdel admin2
Show User Accounts
To show the user accounts configured on the system, run the NVUE nv show system aaa command or the linux sudo cat /etc/passwd command.
cumulus@switch:~$ nv show system aaa
Username Full-name Role enable
---------------- ---------------------------------- ------------ ------
Debian-snmp Unknown system
_apt Unknown system
_lldpd Unknown system
admin2 FIRST LAST system-admin on
...
To show information about a specific user account, run the NVUE nv show system aaa user <user> command:
cumulus@switch:~$ nv show system aaa user admin2
operational applied
--------------- ------------ ------------
full-name FIRST LAST FIRST LAST
hashed-password * *
role system-admin system-admin
enable on on
Enable the root User
The root user does not have a password and cannot log into a switch using SSH. This default account behavior is consistent with Debian.
Enable Console Access
To log into the switch using root from the console, you must set the password for the root account:
cumulus@switch:~$ sudo passwd root
Enter new password:
...
Enable SSH Access
To log into the switch using root with SSH, either:
By default, Cumulus Linux has two user accounts: root and cumulus. The cumulus account is a normal user and is in the group sudo.
You can add more user accounts as needed. Like the cumulus account, these accounts must use sudo to execute privileged commands.
sudo Basics
sudo allows you to execute a command as superuser or another user as specified by the security policy.
The default security policy is sudoers, which you configure in the /etc/sudoers file. Use /etc/sudoers.d/ to add to the default sudoers policy.
Use visudo only to edit the sudoers file; do not use another editor like vi or emacs.
When creating a new file in /etc/sudoers.d, use visudo -f. This option performs sanity checks before writing the file to avoid errors that prevent sudo from working.
Errors in the sudoers file can result in losing the ability to elevate privileges to root. You can fix this issue only by power cycling the switch and booting into single user mode. Before modifying sudoers, enable the root user by setting a password for the root user.
By default, users in the sudo group can use sudo to execute privileged commands. To add users to the sudo group, use the useradd(8) or usermod(8) command. To see which users belong to the sudo group, see /etc/group (man group(5)).
You can run any command as sudo, including su. You must enter a password.
The example below shows how to use sudo as a non-privileged user cumulus to bring up an interface:
cumulus@switch:~$ ip link show dev swp1
3: swp1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master br0 state DOWN mode DEFAULT qlen 500
link/ether 44:38:39:00:27:9f brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link set dev swp1 up
RTNETLINK answers: Operation not permitted
cumulus@switch:~$ sudo ip link set dev swp1 up
Password:
umulus@switch:~$ ip link show dev swp1
3: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br0 state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:27:9f brd ff:ff:ff:ff:ff:ff
sudoers Examples
The following examples show how you grant as few privileges as necessary to a user or group of users to allow them to perform the required task. Each example uses the system group noc; groups include the prefix %.
When an unprivileged user runs a command, the command must include the sudo prefix.
Cumulus Linux uses Pluggable Authentication Modules (PAM) and Name Service Switch (NSS) for user authentication. NSS enables PAM to use LDAP to provide user authentication, group mapping, and information for other services on the system.
NSS specifies the order of the information sources that resolve names for each service. Using NSS with authentication and authorization provides the order and location for user lookup and group mapping on the system.
PAM handles the interaction between the user and the system, providing login handling, session setup, authentication of users, and authorization of user actions.
To configure LDAP authentication on Linux, you can use libnss-ldap, libnss-ldapd, or libnss-sss. This chapter describes libnss-ldapd only. From internal testing, this library worked best with Cumulus Linux and is the easiest to configure, automate, and troubleshoot.
Install libnss-ldapd
The libldap-2.4-2 and libldap-common LDAP packages are already installed on the Cumulus Linux image; however you need to install these additional packages to use LDAP authentication:
libnss-ldapd
libpam-ldapd
ldap-utils
To install the additional packages, run the following command:
You can also install these packages even if the switch does not connect to the internet, as they are in the cumulus-local-apt-archive repository that is embedded in the Cumulus Linux image.
Follow the interactive prompts to specify the LDAP URI, search base distinguished name (DN), and services that must have LDAP lookups enabled. You need to select at least the passwd, group, and shadow services (press space to select a service). When done, select OK. This creates a basic LDAP configuration using anonymous bind and initiates user search under the base DN specified.
After the dialog closes, the install process prints information similar to the following:
/etc/nsswitch.conf: enable LDAP lookups for group
/etc/nsswitch.conf: enable LDAP lookups for passwd
/etc/nsswitch.conf: enable LDAP lookups for shadow
After the installation is complete, the name service caching daemon (nslcd) runs. This service handles all the LDAP protocol interactions and caches information that returns from the LDAP server. nslcd appends ldap to the /etc/nsswitch.conf file, as well as the secondary information source for passwd, group, and shadow. nslcd references the local files (/etc/passwd, /etc/groups and /etc/shadow) first, as specified by the compat source.
Keep compat as the first source in NSS for passwd, group, and shadow. This prevents you from getting locked out of the system.
Entering incorrect information during the installation process produces configuration errors. You can correct the information after installation by editing certain configuration files.
Edit the /etc/nslcd.conf file to update the LDAP URI and search base DN (see Update the nslcd.conf File, below).
Edit the /etc/nssswitch.conf file to update the service selections.
Restart nvued.service and nginx.service after editing the files.
▼
Alternative Installation Method Using debconf-utils
Instead of running the installer and following the interactive prompts, as described above, you can pre-seed the installer parameters using debconf-utils.
Run apt-get install debconf-utils and create the pre-seeded parameters using debconf-set-selections. Provide the appropriate answers.
Run debconf-show <pkg> to check the settings. Here is an example of how to pre-seed answers to the installer questions using debconf-set-selections:
root# debconf-set-selections <<'zzzEndOfFilezzz'
# LDAP database user. Leave blank will be populated later!
nslcd nslcd/ldap-binddn string
# LDAP user password. Leave blank!
nslcd nslcd/ldap-bindpw password
# LDAP server search base:
nslcd nslcd/ldap-base string ou=support,dc=rtp,dc=example,dc=test
# LDAP server URI. Using ldap over ssl.
nslcd nslcd/ldap-uris string ldaps://myadserver.rtp.example.test
# New to 0.9. restart cron, exim and others libraries without asking
nslcd libraries/restart-without-asking: boolean true
# LDAP authentication to use:
# Choices: none, simple, SASL
# Using simple because its easy to configure. Security comes by using LDAP over SSL
# keep /etc/nslcd.conf 'rw' to root for basic security of bindDN password
nslcd nslcd/ldap-auth-type select simple
# Don't set starttls to true
nslcd nslcd/ldap-starttls boolean false
# Check server's SSL certificate:
# Choices: never, allow, try, demand
nslcd nslcd/ldap-reqcert select never
# Choices: Ccreds credential caching - password saving, Unix authentication, LDAP Authentication , Create home directory on first time login, Ccreds credential caching - password checking
# This is where "mkhomedir" pam config is activated that allows automatic creation of home directory
libpam-runtime libpam-runtime/profiles multiselect ccreds-save, unix, ldap, mkhomedir , ccreds-check
# for internal use; can be preseeded
man-db man-db/auto-update boolean true
# Name services to configure:
# Choices: aliases, ethers, group, hosts, netgroup, networks, passwd, protocols, rpc, services, shadow
libnss-ldapd libnss-ldapd/nsswitch multiselect group, passwd, shadow
libnss-ldapd libnss-ldapd/clean_nsswitch boolean false
## define platform specific libnss-ldapd debconf questions/answers.
## For demo used amd64.
libnss-ldapd:amd64 libnss-ldapd/nsswitch multiselect group, passwd, shadow
libnss-ldapd:amd64 libnss-ldapd/clean_nsswitch boolean false
# libnss-ldapd:powerpc libnss-ldapd/nsswitch multiselect group, passwd, shadow
# libnss-ldapd:powerpc libnss-ldapd/clean_nsswitch boolean false
Update the nslcd.conf File
After installation, update the main configuration file (/etc/nslcd.conf) to accommodate the expected LDAP server settings.
This section documents some of the more important options that relate to security and queries. For details on all the available configuration options, read the nslcd.conf man page.
After editing the /etc/nslcd.conf file or enabling LDAP in the /etc/nsswitch.conf file, you must restart the NVUE and nginx services with the sudo systemctl restart nvued.service command and the sudo systemctl restart nginx.service command. If you disable LDAP, you must also restart these two services.
Connection
The LDAP client starts a session by connecting to the LDAP server on TCP and UDP port 389 or on port 636 for LDAPS. Depending on the configuration, this connection establishes without authentication (anonymous bind); otherwise, the client must provide a bind user and password. The variables you use to define the connection to the LDAP server are the URI and bind credentials.
The URI is mandatory and specifies the LDAP server location using the FQDN or IP address. The URI also designates whether to use ldap:// for clear text transport, or ldaps:// for SSL/TLS encrypted transport. You can also specify an alternate port in the URI. In production environments, use the LDAPS protocol so that all communications are secure.
After the connection to the server is complete, the BIND operation authenticates the session. The BIND credentials are optional; if you do not specify the credentials, the switch assumes an anonymous bind. Configure authenticated (Simple) BIND by specifying the user (binddn) and password (bindpw) in the configuration. Another option is to use SASL (Simple Authentication and Security Layer) BIND, which provides authentication services using other mechanisms, like Kerberos. Contact your LDAP server administrator for this information as it depends on the configuration of the LDAP server and the credentials for the client device.
# The location at which the LDAP server(s) should be reachable.
uri ldaps://ldap.example.com
# The DN to bind with for normal lookups.
binddn cn=CLswitch,ou=infra,dc=example,dc=com
bindpw CuMuLuS
Search Function
When an LDAP client requests information about a resource, it must connect and bind to the server. Then, it performs one or more resource queries depending on the lookup. All search queries to the LDAP server use the configured search base, filter, and the desired entry (uid=myuser). If the LDAP directory is large, this search takes a long time. Define a more specific search base for the common maps (passwd and group).
# The search base that will be used for all queries.
base dc=example,dc=com
# Mapped search bases to speed up common queries.
base passwd ou=people,dc=example,dc=com
base group ou=groups,dc=example,dc=com
Search Filters
To limit the search scope when authenticating users, use search filters to specify criteria when searching for objects within the directory. The default filters applied are:
filter passwd (objectClass=posixAccount)
filter group (objectClass=posixGroup)
Attribute Mapping
The map configuration allows you to override the attributes pushed from LDAP. To override an attribute for a given map, specify the attribute name and the new value. This is useful to ensure that the shell is bash and the home directory is /home/cumulus:
In LDAP, the map refers to one of the supported maps specified in the manpage for nslcd.conf (such as passwd or group).
Create Home Directory on Login
If you want to use unique home directories, run the sudo pam-auth-update command and select Create home directory on login in the PAM configuration dialog (press the space bar to select the option). Select OK, then press Enter to save the update and close the dialog.
cumulus@switch:~$ sudo pam-auth-update
The home directory for any user that logs in (using LDAP or not) populates with the standard dotfiles from /etc/skel.
When nslcd starts, an error message similar to the following (where 5816 is the nslcd PID) sometimes appears:
nslcd[5816]: unable to dlopen /usr/lib/x86_64-linux-gnu/sasl2/libsasldb.so: libdb-5.3.so: cannot open
shared object file: No such file or directory
You can ignore this message. The libdb package and resulting log messages from nslcd do not cause any issues when you use LDAP as a client for login and authentication.
Example Configuration
Here is an example configuration using Cumulus Linux.
# /etc/nslcd.conf
# nslcd configuration file. See nslcd.conf(5)
# for details.
# The user and group nslcd should run as.
uid nslcd
gid nslcd
# The location at which the LDAP server(s) should be reachable.
uri ldaps://myadserver.rtp.example.test
# The search base that will be used for all queries.
base ou=support,dc=rtp,dc=example,dc=test
# The LDAP protocol version to use.
#ldap_version 3
# The DN to bind with for normal lookups.
# defconf-set-selections doesn't seem to set this. so have to manually set this.
binddn CN=cumulus admin,CN=Users,DC=rtp,DC=example,DC=test
bindpw 1Q2w3e4r!
# The DN used for password modifications by root.
#rootpwmoddn cn=admin,dc=example,dc=com
# SSL options
#ssl off (default)
# Not good does not prevent man in the middle attacks
#tls_reqcert demand(default)
tls_cacertfile /etc/ssl/certs/rtp-example-ca.crt
# The search scope.
#scope sub
# Add nested group support
# Supported in nslcd 0.9 and higher.
# default wheezy install of nslcd supports on 0.8. wheezy-backports has 0.9
nss_nested_groups yes
# Mappings for Active Directory
# (replace the SIDs in the objectSid mappings with the value for your domain)
# "dsquery * -filter (samaccountname=testuser1) -attr ObjectSID" where cn == 'testuser1'
pagesize 1000
referrals off
idle_timelimit 1000
# Do not allow uids lower than 100 to login (aka Administrator)
# not needed as pam already has this support
# nss_min_uid 1000
# This filter says to get all users who are part of the cumuluslnxadm group. Supports nested groups.
# Example, mary is part of the snrnetworkadm group which is part of cumuluslnxadm group
# Ref: http://msdn.microsoft.com/en-us/library/aa746475%28VS.85%29.aspx (LDAP_MATCHING_RULE_IN_CHAIN)
filter passwd (&(Objectclass=user)(!(objectClass=computer))(memberOf:1.2.840.113556.1.4.1941:=cn=cumuluslnxadm,ou=groups,ou=support,dc=rtp,dc=example,dc=test))
map passwd uid sAMAccountName
map passwd uidNumber objectSid:S-1-5-21-1391733952-3059161487-1245441232
map passwd gidNumber objectSid:S-1-5-21-1391733952-3059161487-1245441232
map passwd homeDirectory "/home/$sAMAccountName"
map passwd gecos displayName
map passwd loginShell "/bin/bash"
# Filter for any AD group or user in the baseDN. the reason for filtering for the
# user to make sure group listing for user files don't say '<user> <gid>'. instead will say '<user> <user>'
# So for cosmetic reasons..nothing more.
filter group (&(|(objectClass=group)(Objectclass=user))(!(objectClass=computer)))
map group gidNumber objectSid:S-1-5-21-1391733952-3059161487-1245441232
map group cn sAMAccountName
Configure LDAP Authorization
Linux uses the sudo command to allow non-administrator users (such as the default cumulus user account) to perform privileged operations. To control the users that can use sudo, define a series of rules in the /etc/sudoers file and files in the /etc/sudoers.d/ directory. The rules apply to groups but you can also define specific users. You can add sudo rules using the group names from LDAP. For example, if a group of users are in the group netadmin, you can add a rule to give those users sudo privileges. Refer to the sudoers manual (man sudoers) for a complete usage description. The following shows an example in the /etc/sudoers file:
# The basic structure of a user specification is "who where = (as_whom) what ".
%sudo ALL=(ALL:ALL) ALL
%netadmin ALL=(ALL:ALL) ALL
Active Directory Configuration
Active Directory (AD) is a fully featured LDAP-based NIS server create by Microsoft. It offers unique features that classic OpenLDAP servers do not have. AD can be more complicated to configure on the client and each version works a little differently with Linux-based LDAP clients. Some more advanced configuration examples, from testing LDAP clients on Cumulus Linux with Active Directory (AD/LDAP), are available in the knowledge base.
LDAP Verification Tools
The LDAP client daemon retrieves and caches password and group information from LDAP. To verify the LDAP interaction, use these command-line tools to trigger an LDAP query from the device.
Identify a User with the id Command
The id command performs a username lookup by following the lookup information sources in NSS for the passwd service. This returns the user ID, group ID and the group list retrieved from the information source. In the following example, the user cumulus is locally defined in /etc/passwd, and myuser is on LDAP. The NSS configuration has the passwd map configured with the sources compat ldap:
cumulus@switch:~$ id cumulus
uid=1000(cumulus) gid=1000(cumulus) groups=1000(cumulus),24(cdrom),25(floppy),27(sudo),29(audio),30(dip),44(video),46(plugdev)
cumulus@switch:~$ id myuser
uid=1230(myuser) gid=3000(Development) groups=3000(Development),500(Employees),27(sudo)
getent
The getent command retrieves all records found with NSS for a given map. It can also retrieve a specific entry under that map. You can perform tests with the passwd, group, shadow, or any other map in the /etc/nsswitch.conf file. The output from this command formats according to the map requested. For the passwd service, the structure of the output is the same as the entries in /etc/passwd. The group map outputs the same structure as /etc/group.
In this example, looking up a specific user in the passwd map, the user cumulus is locally defined in /etc/passwd, and myuser is only in LDAP.
In the next example, looking up a specific group in the group service, the group cumulus is locally defined in /etc/groups, and netadmin is on LDAP.
cumulus@switch:~$ getent group cumulus
cumulus:x:1000:
cumulus@switch:~$ getent group netadmin
netadmin:*:502:larry,moe,curly,shemp
Running the command getent passwd or getent group without a specific request returns all local and LDAP entries for the passwd and group maps.
LDAP search
The ldapsearch command performs LDAP operations directly on the LDAP server. This does not interact with NSS. This command displays the information that the LDAP daemon process receives back from the server. The command has several options. The simplest option uses anonymous bind to the host and specifies the search DN and the attribute to look up.
When setting up LDAP authentication for the first time, turn off the nslcd service using the systemctl stop nslcd.service command (or the systemctl stop nslcd@mgmt.service if you are running the service in a management VRF) and run it in debug mode. Debug mode works whether you are using LDAP over SSL (port 636) or an unencrypted LDAP connection (port 389).
The FQDN of the LDAP server URI does not match the FQDN in the CA-signed server certificate.
nslcd cannot read the SSL certificate and reports a Permission denied error in the debug during server connection negotiation. Check the permission on each directory in the path of the root SSL certificate. Ensure that it is readable by the nslcd user.
NSCD
If the nscd cache daemon is also enabled and you make some changes to the user from LDAP, you can clear the cache using the following commands:
nscd --invalidate = passwd
nscd --invalidate = group
The nscd package works with nslcd to cache name entries returned from the LDAP server. This sometimes causes authentication failures. To work around these issues, disable nscd, restart the nslcd service, then retry authentication:
If you are running the nslcd service in a management VRF, you need to run the systemctl restart nslcd@mgmt.service command instead of the systemctl restart nslcd.service command. For example:
Cumulus Linux implements TACACS+ client AAA in a transparent way with minimal configuration. The client implements the TACACS+ protocol as described in this IETF document. There is no need to create accounts or directories on the switch. Accounting records go to all configured TACACS+ servers by default. Using per-command authorization requires additional setup on the switch.
TACACS+ in Cumulus Linux:
Uses PAM authentication and includes login, ssh, sudo and su.
Allows users with privilege level 15 to run any command with sudo.
Allows users with privilege level 15 to run NVUE nv set, nv unset, and nv apply commands in addition to nv show commands. TACACS+ users with a lower privilege level can only execute nv show commands.
Supports up to seven TACACS+ servers. Be sure to configure your TACACS+ servers in addition to the TACACS+ client. Refer to your TACACS+ server documentation.
Install the TACACS+ Client Packages
You must install the TACACS+ client packages to use TACACS+. If you do not install the TACACS+ packages, you see the following message when you try to enable TACACS+ with the NVUE nv set system aaa tacacs enable on command:
'tacplus-client' package needs to be installed to enable tacacs
You can install the TACACS+ packages even if the switch is not connected to the internet; the packages are in the cumulus-local-apt-archive repository in the Cumulus Linux image.
To install all required packages, run these commands:
After you install the required TACACS+ packages, configure the following required settings on the switch (the TACACS+ client).
Set the IP address or hostname of at least one TACACS+ server.
Set the secret (key) shared between the TACACS+ server and client.
Set the VRF you want to use to communicate with the TACACS+ server. This is typically the management VRF (mgmt), which is the default VRF on the switch.
If you use NVUE commands to configure TACACS+, you must also set the priority for the authentication order for local and TACACS+ users, and enable TACACS+.
After you change TACACS+ settings, you must restart both nvued.service and nginx.service:
NVUE commands require you to specify the priority for each TACACS+ server. You must set a priority even if you only specify one server.
The following example commands set:
The TACACS+ server priority to 5.
The IP address of the server to 192.168.0.30.
The secret to mytacac$key.
If you include special characters in the password (such as $), you must enclose the password in single quotes (').
The VRF to mgmt.
The authentication order so that TACACS+ authentication has priority over local (the lower number has priority).
TACACS+ to enabled.
cumulus@switch:~$ nv set system aaa tacacs server 5 host 192.168.0.30
cumulus@switch:~$ nv set system aaa tacacs server 5 secret 'mytacac$key'
cumulus@switch:~$ nv set system aaa tacacs vrf mgmt
cumulus@switch:~$ nv set system aaa authentication-order 5 tacacs
cumulus@switch:~$ nv set system aaa authentication-order 10 local
cumulus@switch:~$ nv set system aaa tacacs enable on
cumulus@switch:~$ nv config apply
If you want the server to use IPv6, you must add the nv set system aaa tacacs server <priority> prefer-ip-version 6 command:
cumulus@switch:~$ nv set system aaa tacacs server 5 host server5
cumulus@switch:~$ nv set system aaa tacacs server 5 prefer-ip-version 6
...
If you configure more than one TACACS+ server, you need to set the priority for each server. If the switch cannot establish a connection with the server that has the highest priority, it tries to establish a connection with the next highest priority server. The server with the lower number has the higher prioritity. In the example below, server 192.168.0.30 with a priority value of 5 has a higher priority than server 192.168.1.30, which has a priority value of 10.
cumulus@switch:~$ nv set system aaa tacacs server 5 host 192.168.0.30
cumulus@switch:~$ nv set system aaa tacacs server 5 secret 'mytacac$key'
cumulus@switch:~$ nv set system aaa tacacs server 10 host 192.168.1.30
cumulus@switch:~$ nv set system aaa tacacs server 10 secret 'mytacac$key2'
cumulus@switch:~$ nv config apply
Edit the /etc/tacplus_servers file to add at least one server and one shared secret (key). You can specify the server and secret parameters in any order anywhere in the file. Whitespace (spaces or tabs) are not allowed. For example, if your TACACS+ server IP address is 192.168.0.30 and your shared secret is tacacskey, add these parameters to the /etc/tacplus_servers file:
Cumulus Linux supports a maximum of seven TACACS+ servers. To specify multiple servers, add one per line to the /etc/tacplus_servers file. Connections establish in the order in the file.
# If the management network is in a vrf, set this variable to the vrf name.
# This would usually be "mgmt"
# When this variable is set, the connection to the TACACS+ accounting servers
# will be made through the named vrf.
vrf=mgmt
Restart auditd:
cumulus@switch:~$ sudo systemctl restart auditd
Optional TACACS+ Configuration
You can configure the following optional TACACS+ settings:
The port to use for communication between the TACACS+ server and client. By default, Cumulus Linux uses IP port 49.
The TACACS timeout value, which is the number of seconds to wait for a response from the TACACS+ server before trying the next TACACS+ server. You can specify a value between 0 and 60. The default is 5 seconds.
The source IP address to use when communicating with the TACACS+ server so that the server can identify the client switch. You must specify an IPv4 address, which must be valid for the interface you use. This source IP address is typically the loopback address on the switch.
The TACACS+ authentication type. You can specify PAP to send clear text between the user and the server, CHAP to establish a PPP connection between the user and the server, or login. The default is PAP.
The users you do not want to send to the TACACS+ server for authentication; for example, local user accounts that exist on the switch, such as the cumulus user.
A separate home directory for each TACACS+ user when the TACACS+ user first logs in. By default, the switch uses the home directory in the mapping accounts in /etc/passwd. If the home directory does not exist, the mkhomedir_helper program creates it. This option does not apply to accounts with restricted shells (users mapped to a TACACS privilege level that has enforced per-command authorization).
The following example commands set the timeout to 10 seconds and the TACACS+ server port to 32:
cumulus@switch:~$ nv set system aaa tacacs timeout 10
cumulus@switch:~$ nv set system aaa tacacs server 5 port 32
cumulus@switch:~$ nv config apply
The following example commands set the source IP address to 10.10.10.1 and the authentication type to CHAP:
cumulus@switch:~$ nv set system aaa tacacs source-ip 10.10.10.1
cumulus@switch:~$ nv set system aaa tacacs authentication mode chap
cumulus@switch:~$ nv config apply
The following example commands exclude the user USER1 from going to the TACACS+ server for authentication and enables Cumulus Linux to create a separate home directory for each TACACS+ user when the TACACS+ user first logs in:
cumulus@switch:~$ nv set system aaa tacacs exclude-user USER1
cumulus@switch:~$ nv set system aaa tacacs authentication per-user-homedir on
cumulus@switch:~$ nv config apply
To set the server port (use the format server:port), source IP address, authentication type, and enable Cumulus Linux to create a separate home directory for each TACACS+ user, edit the /etc/tacplus_servers file, then restart auditd.
To set the timeout and the usernames to exclude from TACACS+ authentication, edit the /etc/tacplus_nss.conf file (you do not need to restart auditd).
The following example sets the server port to 32, the authentication type to CHAP, the source IP address to 10.10.10.1, and enables Cumulus Linux to create a separate home directory for each TACACS+ user when the TACACS+ user first logs in:
cumulus@switch:~$ sudo nano /etc/tacplus_servers
...
secret=mytacac$key
server=192.168.0.30:32
...
# Sets the IPv4 address used as the source IP address when communicating with
# the TACACS+ server. IPv6 addresses are not supported, nor are hostnames.
# The address must work when passsed to the bind() system call, that is, it must
# be valid for the interface being used.
source_ip=10.10.10.1
...
# If user_homedir=1, then tacacs users will be set to have a home directory
# based on their login name, rather than the mapped tacacsN home directory.
# mkhomedir_helper is used to create the directory if it does not exist (similar
# to use of pam_mkhomedir.so). This flag is ignored for users with restricted
# shells, e.g., users mapped to a tacacs privilege level that has enforced
# per-command authorization (see the tacplus-restrict man page).
user_homedir=1
...
login=chap
cumulus@switch:~$ sudo systemctl restart auditd
The following example sets the timeout to 10 seconds and excludes the user USER1 from going to the TACACS+ server for authentication:
cumulus@switch:~$ sudo nano /etc/tacplus_nss.conf
...
# The connection timeout for an NSS library should be short, since it is
# invoked for many programs and daemons, and a failure is usually not
# catastrophic. Not set or set to a negative value disables use of poll().
# This follows the include of tacplus_servers, so it can override any
# timeout value set in that file.
# It's important to have this set in this file, even if the same value
# as in tacplus_servers, since tacplus_servers should not be readable
# by users other than root.
timeout=10
...
# This is a comma separated list of usernames that are never sent to
# a tacacs server, they cause an early not found return.
#
# "*" is not a wild card. While it's not a legal username, it turns out
# that during pathname completion, bash can do an NSS lookup on "*"
# To avoid server round trip delays, or worse, unreachable server delays
# on filename completion, we include "*" in the exclusion list.
exclude_users=root,daemon,nobody,cron,radius_user,radius_priv_user,sshd,cumulus,quagga,frr,snmp,www-data,ntp,man,_lldpd,USER1,*
Cumulus Linux supports the following additional Linux parameters in the etc/tacplus_nss.conf file. Currently, there are no equivalent NUVE commands.
Linux Parameter
Description
include
Configures a supplemental configuration file to avoid duplicating configuration information. You can include up to eight additional configuration files. For example: include=/myfile/myname.
min_uid
Configures the minimum user ID that the NSS plugin can look up. 0 specifies that the plugin never looks up uid 0 (root). Do not specify a value greater than the local TACACS+ user IDs (0 through 15).
TACACS+ Accounting
When you install the TACACS+ packages and configure the basic TACACS+ settings (set the server and shared secret), accounting is on and there is no additional configuration required.
TACACS+ accounting uses the audisp module, with an additional plugin for auditd and audisp. The plugin maps the auid in the accounting record to a TACACS login, which it bases on the auid and sessionid. The audisp module requires libnss_tacplus and uses the libtacplus_map.so library interfaces as part of the modified libpam_tacplus package.
Communication with the TACACS+ servers occurs with the libsimple-tacact1 library, through dlopen(). A maximum of 240 bytes of command name and arguments send in the accounting record, due to the TACACS+ field length limitation of 255 bytes.
All sudo commands run by TACACS+ users generate accounting records against the original TACACS+ login name.
All Linux and NVUE commands result in an accounting record, including login commands and sub-processes of other commands. This can generate a lot of accounting records.
By default, Cumulus Linux sends accounting records to all servers. You can change this setting to send accounting records to the server that is first to respond:
cumulus@switch:~$ nv set system aaa tacacs accounting send-records first-response
cumulus@switch:~$ nv config apply
To reset to the default configuration (send accounting records to all servers), run the nv set system aaa tacacs accounting send-records all command.
Edit the /etc/audisp/audisp-tac_plus.conf file and change the acct_all parameter to 0:
To reset to the default configuration (send accounting records to all servers), change the value of acct_all to 1 (acct_all=1).
To disable TACACS+ accounting:
cumulus@switch:~$ nv set system aaa tacacs accounting enable off
cumulus@switch:~$ nv config apply
Edit the /etc/audisp/plugins.d/audisp-tacplus.conf file and change the active parameter to no:
cumulus@switch:~$ sudo nano /etc/audisp/plugins.d/audisp-tacplus.conf
...
# default to enabling tacacs accounting; change to no to disable
active = no
Restart auditd:
cumulus@switch:~$ sudo systemctl restart auditd
Local Fallback Authentication
You can configure the switch to allow local fallback authentication for a user when the TACACS servers are unreachable, do not include the user for authentication, or have the user in the exclude user list.
To allow local fallback authentication for a user, add a local privileged user account on the switch with the same username as a TACACS user. A local user is always active even when the TACACS service is not running.
NVUE does not provide commands to configure local fallback authentication.
To configure local fallback authentication:
Edit the /etc/nsswitch.conf file to remove the keyword tacplus from the line starting with passwd. (You need to add the keyword back in step 3.)
The following example shows the /etc/nsswitch.conf file with no tacplus keyword in the line starting with passwd.
cumulus@switch:~$ sudo nano /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.
passwd: files
group: tacplus files
shadow: files
gshadow: files
...
To enable the local privileged user to run sudo and NVUE commands, run the adduser commands shown below. In the example commands, the TACACS account name is tacadmin.
The first adduser command prompts for information and a password. You can skip most of the requested information by pressing ENTER.
Edit the /etc/nsswitch.conf file to add the keyword tacplus back to the line starting with passwd (the keyword you removed in the first step).
cumulus@switch:~$ sudo nano /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.
passwd: tacplus files
group: tacplus files
shadow: files
gshadow: files
...
Restart the nvued service with the following command:
cumulus@switch:~$ sudo systemctl restart nvued
TACACS+ Per-command Authorization
TACACS+ per-command authorization lets you configure the commands that TACACS+ users at different privilege levels can run.
To reach the TACACS+ server through the default VRF, you must specify the egress interface you use in the default VRF. Either run the NVUE nv set system aaa tacacs vrf <interface> command (for example, nv set system aaa tacacs vrf swp51) or set the vrf=<interface> option in the /etc/tacplus_servers file (for example, vrf=swp51).
The following command allows TACACS+ users at privilege level 0 to run the nv and ip commands (if authorized by the TACACS+ server):
cumulus@switch:~$ nv set system aaa tacacs authorization 0 command ip
cumulus@switch:~$ nv set system aaa tacacs authorization 0 command nv
cumulus@switch:~$ nv config apply
To show the per-command authorization settings, run the nv show system aaa tacacs authorization command:
cumulus@switch:~$ nv show system aaa tacacs authorization
Privilege Level role command
--------------- ------------ -------
0 nvue-monitor ip
nv
tacuser0@switch:~$ sudo tacplus-restrict -i -u tacacs0 -a ip nv
The tacplus-auth command handles authorization for each command. To make this an enforced authorization, change the TACACS+ log in to use a restricted shell, with a very limited executable search path. Otherwise, the user can bypass the authorization. The tacplus-restrict utility simplifies setting up the restricted environment.
The following table provides the tacplus-restrict command options:
Option
Description
-i
Initializes the environment. You only need to issue this option one time per username.
-a
You can invoke the utility with the -a option as often as you like. For each command in the -a list, the utility creates a symbolic link from tacplus-auth to the relative portion of the command name in the local bin subdirectory. You also need to enable these commands on the TACACS+ server (refer to your TACACS+ server documentation). It is common for the server to allow some options to a command, but not others.
-f
Re-initializes the environment. If you need to restart, run the -f option with -i to force re-initialization; otherwise, the utility ignores repeated use of -i. During initialization: - The user shell changes to /bin/rbash. - The utility saves any existing dot files.
After running this command, examine the tacacs0 directory::
cumulus@switch:~$ sudo ls -lR ~tacacs0
total 12
lrwxrwxrwx 1 root root 22 Nov 21 22:07 ip -> /usr/sbin/tacplus-auth
lrwxrwxrwx 1 root root 22 Nov 21 22:07 nv -> /usr/sbin/tacplus-auth
Except for shell built-ins, privilege level 0 TACACS users can only run the ip and nv commands.
If you add commands with the -a option by mistake, you can remove them. The example below removes the nv command:
cumulus@switch:~$ sudo rm ~tacacs0/bin/nv
To remove all commands:
cumulus@switch:~$ sudo rm ~tacacs0/bin/*
Remove the TACACS+ Client Packages
To remove all the TACACS+ client packages, use the following commands:
Run the following commands to show TACACS+ configuration:
To show all TACACS+ configuration (NVUE hides server secret keys), run the nv show aaa tacacs command.
To show TACACS+ authentication configuration , run the nv show system aaa tacacs authentication command.
To show TACACS+ accounting configuration , run the nv show system aaa tacacs accounting command.
To show TACACS+ server configuration, run the nv show system aaa tacacs server command.
To show TACACS+ server priority configuration, run the nv show system aaa tacacs server <priority-id> command.
To show the list of users excluded from TACACS+ server authentication, run the nv show system aaa tacacs exclude-user command.
The following example command shows all TACACS+ configuration:
cumulus@switch:~$ nv show system aaa tacacs
applied
------------------ -------
enable off
debug-level 0
timeout 5
vrf mgmt
accounting
enable off
authentication
mode pap
per-user-homedir off
[server] 5
[server] 10
The following command shows the list of users excluded from TACACS+ server authentication:
cumulus@switch:~$ nv show system aaa tacacs exclude-user
applied
-------- -------
username USER1
Basic Server Connectivity or NSS Issues
You can use the getent command to determine if you configured TACACS+ correctly and if the local password is in the configuration files. In the example commands below, the cumulus user represents the local user, while cumulusTAC represents the TACACS user.
To look up the username within all NSS methods:
cumulus@switch:~$ sudo getent passwd cumulusTAC
cumulusTAC:x:1016:1001:TACACS+ mapped user at privilege level 15,,,:/home/tacacs15:/bin/bash
To look up the user within the local database only:
To look up the user within the TACACS+ database only:
cumulus@switch:~$ sudo getent -s tacplus passwd cumulusTAC
cumulusTAC:x:1016:1001:TACACS+ mapped user at privilege level 15,,,:/home/tacacs15:/bin/bash
If TACACS+ is not working correctly, you can use debugging. Add the debug=1 parameter to the /etc/tacplus_servers and /etc/tacplus_nss.conf files; see the Linux Commands under Optional TACACS+ Configuration above. You can also add debug=1 to individual pam_tacplus lines in /etc/pam.d/common*.
All log messages are in /var/log/syslog.
Incorrect Shared Key
The TACACS client on the switch and the TACACS server must have the same shared secret key. If this key is incorrect, the following message prints to syslog:
2017-09-05T19:57:00.356520+00:00 leaf01 sshd[3176]: nss_tacplus: TACACS+ server 192.168.0.254:49 read failed with protocol error (incorrect shared secret?) user cumulus
Debug Issues with Per-command Authorization
To debug TACACS user command authorization, have the TACACS+ user enter the following command at a shell prompt, then try the command again:
tacuser0@switch:~$ export TACACSAUTHDEBUG=1
When you enable debugging, the command authorization conversation with the TACACS+ server shows additional information.
To disable debugging:
tacuser0@switch:~$ export -n TACACSAUTHDEBUG
Debug Issues with Accounting Records
If you add or delete TACACS+ servers from the configuration files, make sure you notify the audisp plugin with this command:
If accounting records do not send, add debug=1 to the /etc/audisp/audisp-tac_plus.conf file, then run the command above to notify the plugin. Ask the TACACS+ user to run a command and examine the end of /var/log/syslog for messages from the plugin. You can also check the auditing log file /var/log/audit audit.log to be sure the auditing records exist. If the auditing records do not exist, restart the audit daemon with:
Cumulus Linux uses the following packages for TACACS.
Package
Description
audisp-tacplus
Uses auditing data from auditd to send accounting records to the TACACS+ server and starts as part of auditd.
libtac2
Provides basic TACACS+ server utility and communication routines.
libnss-tacplus
Provides an interface between libc username lookups, the mapping functions, and the TACACS+ server.
tacplus-auth
Includes the tacplus-restrict setup utility, which enables you to perform per-command TACACS+ authorization. Per-command authorization is not the default.
libpam-tacplus
Provides a modified version of the standard Debian package.
libtacplus-map1
Provides mapping between local and TACACS+ users on the server. The package:- Sets the immutable sessionid and auditing UID to ensure that you can track the original user through multiple processes and privilege changes.- Sets the auditing loginuid as immutable.- Creates and maintains a status database in /run/tacacs_client_map to manage and lookup mappings.
libsimple-tacacct1
Provides an interface for programs to send accounting records to the TACACS+ server. audisp-tacplus uses this package.
libtac2-bin
Provides the tacc testing program and TACACS+ man page.
TACACS+ Client Configuration Files
The following table describes the TACACS+ client configuration files that Cumulus Linux uses.
Filename
Description
/etc/tacplus_servers
The primary file that requires configuration after installation. All packages with include=/etc/tacplus_servers parameters use this file. Typically, this file contains the shared secrets; make sure that the Linux file mode is 600.
/etc/nsswitch.conf
When the libnss_tacplus package installs, this file configures tacplus lookups through libnss_tacplus. If you replace this file by automation, you need to add tacplus as the first lookup method for the passwd database line.
/etc/tacplus_nss.conf
Sets the basic parameters for libnss_tacplus. The file includes a debug variable for debugging NSS lookups separately from other client packages.
/usr/share/pam-configs/tacplus
The configuration file for pam-auth-update to generate the files in the next row. The file uses these configurations at login, by su, and by ssh.
/etc/pam.d/common-*
The /etc/pam.d/common-* files update for tacplus authentication. The files update with pam-auth-update when you install or remove libpam-tacplus.
/etc/sudoers.d/tacplus
Allows TACACS+ privilege level 15 users to run commands with sudo. The file includes an example (commented out) of how to enable privilege level 15 TACACS users to use sudo without a password and provides an example of how to enable all TACACS users to run specific commands with sudo. Only edit this file with the visudo -f /etc/sudoers.d/tacplus command.
/etc/audisp/plugins.d/audisp-tacplus.conf
The audisp plugin configuration file. You do not need to modify this file.
/etc/audisp/audisp-tac_plus.conf
The TACACS+ server configuration file for accounting. You do not need to modify this file. You can use this configuration file when you only want to debug TACACS+ accounting issues, not all TACACS+ users.
/etc/audit/rules.d/audisp-tacplus.rules
The auditd rules for TACACS+ accounting. The augenrules command uses all rule files to generate the rules file.
/etc/audit/audit.rules
The audit rules file that generate when you install auditd.
Considerations
Multiple TACACS+ Users
If two or more TACACS+ users log in simultaneously with the same privilege level, while the accounting records are correct, a lookup on either name matches both users, while a UID lookup only returns the user that logs in first.
As a result, any processes that either user runs apply to both and all files either user creates apply to the first name matched. This is similar to adding two local users to the password file with the same UID and GID and is an inherent limitation of using the UID for the base user from the password file.
The current algorithm returns the first name matching the UID from the mapping file; either the first or the second user that logs in.
To work around this issue, you can use the switch audit log or the TACACS server accounting logs to determine which processes and files each user creates.
For commands that do not execute other commands (for example, changes to configurations in an editor or actions with tools like clagctl and vtysh), there is no additional accounting.
Per-command authorization is at the most basic level (Cumulus Linux uses standard Linux user permissions for the local TACACS users and only privilege level 15 users can run sudo commands by default).
The Linux auditd system does not always generate audit events for processes when terminated with a signal (with the kill system call or internal errors such as SIGSEGV). As a result, processes that exit on a signal that you do not handle, generate a STOP accounting record.
Issues with the deluser Command
TACACS+ and other non-local users that run the deluser command with the --remove-home option see the following error:
tacuser0@switch: deluser --remove-home USERNAME
userdel: cannot remove entry 'USERNAME' from /etc/passwd
/usr/sbin/deluser: `/usr/sbin/userdel USERNAME' returned error code 1. Exiting
The command does remove the home directory. The user can still log in on that account but does not have a valid home directory. This is a known upstream issue with the deluser command for all non-local users.
Only use the --remove-home option with the user_homedir=1 configuration command.
Both TACACS+ and RADIUS AAA Clients
When you install both the TACACS+ and the RADIUS AAA client, Cumulus Linux does not attempt RADIUS login. As a workaround, do not install both the TACACS+ and the RADIUS AAA client on the same switch.
TACACS+ and PAM
PAM modules and an updated version of the libpam-tacplus package configure authentication initially. When you install the package, the pam-auth-update command updates the PAM configuration in /etc/pam.d. If you make changes to your PAM configuration, you need to integrate these changes. If you also use LDAP with the libpam-ldap package, you need to edit the PAM configuration with the LDAP and TACACS ordering you prefer. The libpam-tacplus package ignore rules and the values in success=2 require adjustments to ignore LDAP rules.
The TACACS+ privilege attribute priv_lvl determines the privilege level for the user that the TACACS+ server returns during the user authorization exchange. The client accepts the attribute in either the mandatory or optional forms and also accepts priv-lvl as the attribute name. The attribute value must be a numeric string in the range 0 to 15, with 15 the most privileged level.
By default, TACACS+ users at privilege levels other than 15 cannot run sudo commands and can only run commands with standard Linux user permissions.
You can edit the /etc/pam.d/common-* files manually. However, if you run pam-auth-update again after making the changes, the update fails. Only configure /usr/share/pam-configs/tacplus, then run pam-auth-update.
NSS Plugin
With pam_tacplus, TACACS+ authenticated users can log in without a local account on the system using the NSS plugin that comes with the tacplus_nss package. The plugin uses the mapped tacplus information if the user is not in the local password file, provides the getpwnam() and getpwuid()entry points, and uses the TACACS+ authentication functions.
The plugin asks the TACACS+ server if it knows the user, and then for relevant attributes to determine the privilege level of the user. When you install the libnss_tacplus package, nsswitch.conf changes to set tacplus as the first lookup method for passwd. If you change the order, lookups return the local accounts, such as tacacs0
If TACACS+ server does not find the user, it uses the libtacplus.so exported functions to do a mapped lookup. The privilege level appends to tacacs and the lookup searches for the name in the local password file. For example, privilege level 15 searches for the tacacs15 user. If the TACACS+ server finds the user, it adds information for the user in the password structure.
If the TACACS+ server does not find the user, it decrements the privilege level and checks again until it reaches privilege level 0 (user tacacs0). This allows you to use only the two local users tacacs0 and tacacs15, for minimal configuration.
TACACS+ Client Sequencing
Cumulus Linux requires the following information at the beginning of the AAA sequence:
Whether the user is a valid TACACS+ user
The user privilege level
For non-local users (users not in the local password file) you need to send a TACACS+ authorization request as the first communication with the TACACS+ server, before authentication and before the user logging in requests a password.
You need to configure certain TACACS+ servers to allow authorization requests before authentication. Contact your TACACS+ server vendor for information.
Multiple Servers with Different User Accounts
If you configure multiple TACACS+ servers that have different user accounts:
TACACS+ authentication allows for fall through; if the first reachable server does not authenticate the user, the client tries the second server, and so on.
TACACS authorization does not fall through. If the first reachable server returns an unauthorized result, the command is unauthorized and the client does not try the next server.
RADIUS AAA
Various add-on packages enable RADIUS users to log in to Cumulus Linux switches in a transparent way with minimal configuration. There is no need to create accounts or directories on the switch. Authentication uses PAM and includes login, ssh, sudo and su.
Install the RADIUS Packages
You can install the RADIUS packages even if the switch is not connected to the internet, as they are in the cumulus-local-apt-archive repository, which is embedded in the Cumulus Linux image.
After installation is complete, either reboot the switch or run the sudo systemctl restart nvued command.
The libpam-radius-auth package supplied with the Cumulus Linux RADIUS client is a newer version than the one in Debian Buster. This package contains support for IPv6, the src_ip option described below, as well as bug fixes and minor features. The package also includes VRF support, provides man pages describing the PAM and RADIUS configuration, and sets the SUDO_PROMPT environment variable to the login name for RADIUS mapping support.
The libnss-mapuser package is specific to Cumulus Linux and supports the getgrent, getgrnam and getgrgid library interfaces. These interfaces add logged in RADIUS users to the group member list for groups that contain the mapped_user (radius_user) if the RADIUS account does not have privileges, and add privileged RADIUS users to the group member list for groups that contain the mapped_priv_user (radius_priv_user) during the group lookups.
During package installation:
The PAM configuration updates automatically using pam-auth-update (8), and the NSS configuration file /etc/nsswitch.conf adds the mapuser and mapuid plugins. If you remove or purge the packages, these files remove the configuration for these plugins.
The radius_shell package installs the /sbin/radius_shell and setcap cap_setuid program for the login shell for RADIUS accounts. The package adjusts the UID when needed, then runs the bash shell with the same arguments. When installed, the package changes the shell of the RADIUS accounts to /sbin//radius_shell, and to /bin/shell if you remove the package. You need this package to enable privileged RADIUS users. You do not need this package for regular RADIUS clients.
The nvshow group includes the radius_user account, the nvset and nvapply groups and sudo groups include the radius_priv_user account. This change enables all RADUS logins to run NVUE nv show commands and all privileged RADIUS users to also run nv set, nv unset, and nv apply commands, and to use sudo.
Configure the RADIUS Client
To configure the RADIUS client, edit the /etc/pam_radius_auth.conf file.
After editing the /etc/pam_radius_auth.conf file, you must restart both nvued.service and nginx.service:
Add the hostname or IP address of at least one RADIUS server (such as a freeradius server on Linux), and the shared secret used to authenticate and encrypt communication with each server.
You must be able to resolve the hostname of the switch to an IP address. If for some reason you cannot find the hostname in DNS, you can add the hostname to the /etc/hosts file manually. However, this can cause problems because DHCP assigns the IP address, which can change at any time.
Multiple server configuration lines are verified in the order listed. Other than memory, there is no limit to the number of RADIUS servers you can use.
The server port number or name is optional. The system looks up the port in the /etc/services file. However, you can override the ports in the /etc/pam_radius_auth.conf file.
If the server is slow or latencies are high, change the timeout setting. The setting defaults to 3 seconds.
If you want to use a specific interface to reach the RADIUS server, specify the src_ip option. You can specify the hostname of the interface, an IPv4, or an IPv6 address. If you specify the src_ip option, you must also specify the timeout option.
Set the vrf-name field. This is typically set to mgmt if you are using a management VRF. You cannot specify more than one VRF.
The configuration file includes the mapped_priv_user field that sets the account used for privileged RADIUS users and the priv-lvl field that sets the minimum value for the privilege level to be a privileged login (the default value is 15). If you edit these fields, make sure the values match those set in the /etc/nss_mapuser.conf file.
The following example provides a sample /etc/pam_radius_auth.conf file configuration:
mapped_priv_user radius_priv_user
# server[:port] shared_secret timeout (secs) src_ip
192.168.0.254 secretkey
other-server othersecret 3 192.168.1.10
# when mgmt vrf is in use
vrf-name mgmt
If this is the first time you are configuring the RADIUS client, uncomment the debug line for troubleshooting. The debugging messages write to /var/log/syslog. When the RADIUS client is working correctly, comment out the debug line.
As an optional step, you can set PAM configuration keywords by editing the /usr/share/pam-configs/radius file. After you edit the file, you must run the pam-auth-update --package command. The pam_radius_auth (8) man page describes the PAM configuration keywords.
The value of the VSA (Vendor Specific Attribute) shell:priv-lvl determines the privilege level for the user on the switch. If the attribute does not return, the user does not have privileges. The following shows an example using the freeradius server for a fully privileged user.
The VSA vendor name (Cisco-AVPair in the example above) can have any content. The RADIUS client only checks for the string shell:priv-lvl.
Enable Login without Local Accounts
LDAP is not commonly used with switches and adding accounts locally is cumbersome, Cumulus Linux includes a mapping capability with the libnss-mapuser package.
Mapping uses two NSS (Name Service Switch) plugins, one for account name, and one for UID lookup. The installation process configures these accounts automatically in the /etc/nsswitch.conf file and removes them when you delete the package. See the nss_mapuser (8) man page for the full description of this plugin.
A username is mapped at login to a fixed account specified in the configuration file, with the fields of the fixed account used as a template for the user that is logging in.
For example, if you look up the name dave and the fixed account in the configuration file is radius\_user, and that entry in /etc/passwd is:
then the matching line that returns when you run getent passwd dave is:
cumulus@switch:~$ getent passwd dave
dave:x:1017:1002:dave mapped user:/home/dave:/bin/bash
The login process creates the home directory /home/dave if it does not already exist and populates it with the standard skeleton files by the mkhomedir_helper command.
The configuration file /etc/nss_mapuser.conf configures the plugins. The file includes the mapped account name, which is radius_user by default. You can change the mapped account name by editing the file. The nss_mapuser (5) man page describes the configuration file.
A flat file mapping derives from the session number assigned during login, which persists across su and sudo. Cumulus Linux removes the mapping at logout.
Local Fallback Authentication
If a site wants to allow local fallback authentication for a user when none of the RADIUS servers are reachable, you can add a privileged user account as a local account on the switch. The local account must have the same unique identifier as the privileged user and the shell must be the same.
To configure local fallback authentication:
Add a local privileged user account. For example, if the radius_priv_user account in the /etc/passwd file is radius_priv_user:x:1002:1001::/home/radius_priv_user:/sbin/radius_shell, run the following command to add a local privileged user account named johnadmin:
The RADIUS fixed account is not removed from the /etc/passwd or /etc/group file and the home directories are not removed. They remain in case there are modifications to the account or files in the home directories.
To remove the home directories of the RADIUS users, first get the list by running:
cumulus@switch:~$ sudo ls -l /home | grep radius
For all users listed, except the radius_user, run this command to remove the home directories:
where USERNAME is the account name (the home directory relative portion). This command gives the following warning because the user is not listed in the /etc/passwd file.
userdel: cannot remove entry 'USERNAME' from /etc/passwd
/usr/sbin/deluser: `/usr/sbin/userdel USERNAME' returned error code 1. Exiting.
After you remove all the RADIUS users, run the command to remove the fixed account. If there are changes to the account in the /etc/nss_mapuser.conf file, use that account name instead of radius_user.
If two or more RADIUS users log in simultaneously, a UID lookup only returns the user that logs in first. Any process that either user runs applies to both, and all files that either user creates apply to the first name matched. This process is similar to adding two local users to the password file with the same UID and GID, and is an inherent limitation of using the UID for the fixed user from the password file. The current algorithm returns the first name matching the UID from the mapping file, which is either the first or second user that logs in.
When you install both the TACACS+ and the RADIUS AAA client, Cumulus Linux does not attempt the RADIUS login. As a workaround, do not install both the TACACS+ and the RADIUS AAA client on the same switch.
When the RADIUS server is reachable outside of the management VRF, such as in the default VRF, you might see the following error message when you try to run sudo:
2008-10-31T07:06:36.191359+00:00 SW01 sudo: pam_radius_auth(sudo:auth): Bind for server 10.1.1.25 failed: Cannot assign requested address
2008-10-31T07:06:36.192307+00:00 sw01 sudo: pam_radius_auth(sudo:auth): No valid server found in configuration file /etc/pam_radius_auth.conf
The error occurs because sudo tries to authenticate to the RADIUS server through the management VRF. Before you run sudo, you must set the shell to the correct VRF:
Netfilter is the packet filtering framework in Cumulus Linux and other Linux distributions. You can use several different tools to configure ACLs in Cumulus Linux:
iptables, ip6tables, and ebtables are Linux userspace tools you use to administer filtering rules for IPv4 packets, IPv6 packets, and Ethernet frames (layer 2 using MAC addresses).
cl-acltool is a Cumulus Linux-specific userspace tool you use to administer filtering rules and configure default ACLs. cl-acltool operates on various configuration files and uses iptables, ip6tables, and ebtables to install rules into the kernel. In addition, cl-acltool programs rules in hardware for switch port interfaces, which iptables, ip6tables and ebtables cannot do on their own.
NVUE is a Cumulus Linux-specific userspace tool you can use to configure custom ACLs.
Traffic Rules
Chains
Netfilter describes the way that the Linux kernel classifies and controls packets to, from, and across the switch. Netfilter does not require a separate software daemon to run; it is part of the Linux kernel. Netfilter asserts policies at layer 2, 3 and 4 of the OSI model by inspecting packet and frame headers according to a list of rules. The iptables, ip6tables, and ebtables userspace applications provide syntax you use to define rules.
The rules inspect or operate on packets at several points (chains) in the life of the packet through the system:
PREROUTING touches packets before the switch routes them.
INPUT touches packets after the switch determines that the packets are for the local system but before the control plane software receives them.
FORWARD touches transit traffic as it moves through the switch.
OUTPUT touches packets from the control plane software before they leave the switch.
POSTROUTING touches packets immediately before they leave the switch but after a routing decision.
Tables
When you build rules to affect the flow of traffic, tables can access the individual chains. Linux provides three tables by default:
Filter classifies traffic or filters traffic
NAT applies Network Address Translation rules
Mangle alters packets as they move through the switch
Each table has a set of default chains that modify or inspect packets at different points of the path through the switch. Chains contain the individual rules to influence traffic.
Rules
Rules classify the traffic you want to control. You apply rules to chains, which attach to tables.
Rules have several different components:
Table: The first argument is the table.
Chain: The second argument is the chain. Each table supports several different chains. See Tables above.
Matches: The third argument is the match. You can specify multiple matches in a single rule. However, the more matches you use in a rule, the more memory the rule consumes.
Jump: The jump specifies the target of the rule; what action to take if the packet matches the rule. If you omit this option in a rule, matching the rule has no effect on the packet, but the counters on the rule increment.
Targets: The target is a user-defined chain (other than the one this rule is in), one of the special built-in targets that decides the fate of the packet immediately (like DROP), or an extended target. See Supported Rule Types below for different target examples.
How Rules Parse and Apply
The switch reads all the rules from each chain from iptables, ip6tables, and ebtables and enters them in order into either the filter table or the mangle table. The switch reads the rules from the kernel in the following order:
IPv6 (ip6tables)
IPv4 (iptables)
ebtables
When you combine and put rules into one table, the order determines the relative priority of the rules; iptables and ip6tables have the highest precedence and ebtables has the lowest.
The Linux packet forwarding construct is an overlay for how the silicon underneath processes packets. Be aware of the following:
The switch silicon reorders rules when switchd writes to the ASIC, whereas traditional iptables execute the list of rules in order.
All rules, except for POLICE and SETCLASS rules, are terminating; after a rule matches, the action occurs and no more rules process.
When processing traffic, rules affecting the FORWARD chain that specify an ingress interface process before rules that match on an egress interface. As a workaround, rules that only affect the egress interface can have an ingress interface wildcard (only swp+ and bond+) that matches any interface you apply so that you can maintain order of operations with other input interface rules. For example, with the following rules:
-A FORWARD -i swp1 -j ACCEPT
-A FORWARD -o swp1 -j ACCEPT <-- This rule processes LAST (because of egress interface matching)
-A FORWARD -i swp2 -j DROP
If you modify the rules like this, they process in order:
-A FORWARD -i swp1 -j ACCEPT
-A FORWARD -i swp+ -o $PORTA -j ACCEPT <-- These rules are performed in order (because of wildcard match on the ingress interface)
-A FORWARD -i swp2 -j DROP
When using rules that do a mangle and a filter lookup for a packet, Cumulus Linux processes them in parallel and combines the action.
If there is no ingress interface or egress interface match, Cumulus Linux installs FORWARD chain rules in ingress by default.
When using the OUTPUT chain, you must assign rules to the source. For example, if you assign a rule to the switch port in the direction of traffic but the source is a bridge (VLAN), the rule does not affect the traffic and you must apply it to the bridge.
If you need to apply a rule to all transit traffic, use the FORWARD chain, not the OUTPUT chain.
The switch puts ebtable rules into either the IPv4 or IPv6 memory space depending on whether the rule uses IPv4 or IPv6 to make a decision. The switch only puts layer 2 rules that match the MAC address into the IPv4 memory space.
Rule Placement in Memory
INPUT and ingress (FORWARD -i) rules occupy the same memory space. A rule counts as ingress if you set the -i option. If you set both input and output options (-i and -o), the switch considers the rule as ingress and occupies that memory space. For example:
If you remove the -o option and the interface, it is a valid rule.
Nonatomic Update Mode and Atomic Update Mode
Cumulus Linux enables atomic update mode by default. However, this mode limits the number of ACL rules that you can configure.
To increase the number of configurable ACL rules, configure the switch to operate in nonatomic mode.
Instead of reserving 50% of your TCAM space for atomic updates, incremental update uses the available free space to write the new TCAM rules and swap over to the new rules after this is complete. Cumulus Linux then deletes the old rules and frees up the original TCAM space. If there is insufficient free space to complete this task, the original nonatomic update runs, which interrupts traffic.
You can enable nonatomic updates for switchd, which offer better scaling because all TCAM resources actively impact traffic. With atomic updates, half of the hardware resources are on standby and do not actively impact traffic.
Incremental nonatomic updates are table based, so they do not interrupt network traffic when you install new rules. The rules map to the following tables and update in this order:
mirror (ingress only)
ipv4-mac (can be both ingress and egress)
ipv6 (ingress only)
The incremental nonatomic update operation follows this order:
Updates are incremental, one table at a time without stopping traffic.
Cumulus Linux checks if the rules in a table are different from installation time; if a table does not have any changes, it does not reinstall the rules.
If there are changes in a table, the new rules populate in new groups or slices in hardware, then that table switches over to the new groups or slices.
Finally, old resources for that table free up. This process repeats for each of the tables listed above.
If there are insufficient resources to hold both the new rule set and old rule set, Cumulus Linux tries the regular nonatomic mode, which interrupts network traffic.
If the regular nonatomic update fails, Cumulus Linux reverts back to the previous rules.
To always reload switchd with nonatomic updates:
Edit /etc/cumulus/switchd.conf.
Add the following line to the file:
acl.non_atomic_update_mode = TRUE
Reload switchd with the sudo systemctl reload switchd.service command for the changes to take effect. The reload does not interrupt network services.
During regular non-incremental nonatomic updates, traffic stops, then continues after all the new configuration is in the hardware.
Use iptables, ip6tables, and ebtables Directly
Do not use iptables, ip6tables, ebtables directly; installed rules only apply to the Linux kernel and Cumulus Linux does not hardware accelerate. When you run cl-acltool -i, Cumulus Linux resets all rules and deletes anything that is not in /etc/cumulus/acl/policy.conf.
For example, the following rule appears to work:
cumulus@switch:~$ sudo iptables -A INPUT -p icmp --icmp-type echo-request -j DROP
The cl-acltool -L command shows the rule:
cumulus@switch:~$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 72 packets, 5236 bytes)
pkts bytes target prot opt in out source destination
0 0 DROP icmp -- any any anywhere anywhere icmp echo-request
However, Cumulus Linux does not synchronize the rule to hardware. Running cl-acltool -i or reboot removes the rule without replacing it. To ensure that Cumulus Linux hardware accelerates all rules that can be in hardware, add them to /etc/cumulus/acl/policy.conf and install them with the cl-acltool -i command.
Estimate the Number of Rules
To estimate the number of rules you can create from an ACL entry, first determine if that entry is an ingress or an egress. Then, determine if it is an IPv4-mac or IPv6 type rule. This determines the slice to which the rule belongs. Use the following to determine how many entries the switch uses for each type.
By default, each entry occupies one double wide entry, except if the entry is one of the following:
An entry with multiple comma-separated input interfaces splits into one rule for each input interface. For example, this entry splits into two rules:
-A FORWARD -i swp1s0,swp1s1 -p icmp -j ACCEPT
An entry with multiple comma-separated output interfaces splits into one rule for each output interface. This entry splits into two rules:
-A FORWARD -i swp+ -o swp1s0,swp1s1 -p icmp -j ACCEPT
An entry with both input and output comma-separated interfaces splits into one rule for each combination of input and output interface This entry splits into four rules:
-A FORWARD -i swp1s0,swp1s1 -o swp1s2,swp1s3 -p icmp -j ACCEPT
An entry with multiple layer 4 port ranges splits into one rule for each range. For example, this entry splits into two rules:
You can match on VLAN IDs on layer 2 interfaces for ingress rules. The following example matches on a VLAN and DSCP class, and sets the internal class of the packet. For extended matching on IP fields, combine this rule with ingress iptable rules.
[ebtables]
-A FORWARD -p 802_1Q --vlan-id 100 -j mark --mark-set 102
[iptables]
-A FORWARD -i swp31 -m mark --mark 102 -m dscp --dscp-class CS1 -j SETCLASS --class 2
Cumulus Linux reserves mark values between 0 and 100; for example, if you use --mark-set 10, you see an error. Use mark values between 101 and 4196.
You cannot mark multiple VLANs with the same value.
Install and Manage ACL Rules with NVUE
Instead of crafting a rule by hand, then installing it with cl-acltool, you can use NVUE commands. Cumulus Linux converts the commands to the /etc/cumulus/acl/policy.d/50_nvue.rules file. The rules you create with NVUE are independent of the default files /etc/cumulus/acl/policy.d/00control_plane.rules and 99control_plane_catch_all.rules.
Cumulus Linux 5.0 and later uses the -t mangle -A PREROUTING chain for ingress rules and the -t mangle -A POSTROUTING chain for egress rules instead of the - A FORWARD chain used in previous releases.
To create this rule with NVUE, follow the steps below. NVUE adds all options in the rule automatically.
Set the rule type, the matching protocol, source IP address and port, destination IP address and port, and the action. You must provide a name for the rule (EXAMPLE1 in the commands below):
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip protocol tcp
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip source-ip 10.0.14.2/32
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip source-port ANY
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip dest-ip 10.0.15.8/32
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip dest-port ANY
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 action permit
Apply the rule to an inbound or outbound interface with the nv set interface <interface> acl command.
For rules affecting the -t mangle -A PREROUTING chain (-A FORWARD in previous releases), apply the rule to an inbound or outbound interface: For example:
To see all installed rules, examine the /etc/cumulus/acl/policy.d/50_nvue.rules file:
cumulus@switch:~$ sudo cat /etc/cumulus/acl/policy.d/50_nvue.rules
[iptables]
## ACL EXAMPLE1 in dir inbound on interface swp1 ##
-t mangle -A PREROUTING -i swp1 -s 10.0.14.2/32 -d 10.0.15.8/32 -p tcp -j ACCEPT
...
To remove this rule, run the nv unset acl <acl-name> and nv unset interface <interface> acl <acl-name> commands. These commands delete the rule from the /etc/cumulus/acl/policy.d/50_nvue.rules file.
To show ACL statistics per interface, such as the total number of bytes that match the ACL rule, run the nv show interface <interface-id> acl <acl-id> statistics or nv show interface <interface-id> acl <acl-id> statistics <rule-id> command.
To see the list of all NVUE ACL commands, run the nv list-commands acl command.
Install and Manage ACL Rules with cl-acltool
You can manage Cumulus Linux ACLs with cl-acltool. Rules write first to the iptables chains, as described above, and then synchronize to hardware through switchd.
To examine the current state of chains and list all installed rules, run:
cumulus@switch:~$ sudo cl-acltool -L all
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 90 packets, 14456 bytes)
pkts bytes target prot opt in out source destination
0 0 DROP all -- swp+ any 240.0.0.0/5 anywhere
0 0 DROP all -- swp+ any loopback/8 anywhere
0 0 DROP all -- swp+ any base-address.mcast.net/8 anywhere
0 0 DROP all -- swp+ any 255.255.255.255 anywhere ...
To list installed rules using native iptables, ip6tables and ebtables, use the -L option with the respective commands:
If the install fails, ACL rules in the kernel and hardware roll back to the previous state. You also see errors from programming rules in the kernel or ASIC.
Install Packet Filtering (ACL) Rules
cl-acltool takes access control list (ACL) rule input in files. Each ACL policy file includes iptables, ip6tables and ebtables categories under the tags [iptables], [ip6tables] and [ebtables]. You must assign each rule in an ACL policy to one of the rule categories.
See man cl-acltool(5) for ACL rule details. For iptables rule syntax, see man iptables(8). For ip6tables rule syntax, see man ip6tables(8). For ebtables rule syntax, see man ebtables(8).
See man cl-acltool(5) and man cl-acltool(8) for more details on using cl-acltool.
By default:
ACL policy files are in /etc/cumulus/acl/policy.d/.
All *.rules files in /etc/cumulus/acl/policy.d/ directory are also in /etc/cumulus/acl/policy.conf.
All files in the policy.conf file install when the switch boots up.
The policy.conf file expects rule files to have a .rules suffix as part of the file name.
Here is an example ACL policy file:
[iptables]
-A INPUT -i swp1 -p tcp --dport 80 -j ACCEPT
-A FORWARD -i swp1 -p tcp --dport 80 -j ACCEPT
[ip6tables]
-A INPUT -i swp1 -p tcp --dport 80 -j ACCEPT
-A FORWARD -i swp1 -p tcp --dport 80 -j ACCEPT
[ebtables]
-A INPUT -p IPv4 -j ACCEPT
-A FORWARD -p IPv4 -j ACCEPT
You can use wildcards or variables to specify chain and interface lists.
You can only use swp+ and bond+ as wildcard names.
swp+ rules apply as an aggregate, not per port. If you want to apply per port policing, specify a specific port instead of the wildcard.
You can write ACL rules for the system into multiple files under the default /etc/cumulus/acl/policy.d/ directory. The ordering of rules during installation follows the sort order of the files according to their file names.
Use multiple files to stack rules. The example below shows two rule files that separate rules for management and datapath traffic:
cumulus@switch:~$ ls /etc/cumulus/acl/policy.d/
00sample_mgmt.rules 01sample_datapath.rules
cumulus@switch:~$ cat /etc/cumulus/acl/policy.d/00sample_mgmt.rules
INGRESS_INTF = swp+
INGRESS_CHAIN = INPUT
[iptables]
# protect the switch management
-A $INGRESS_CHAIN -i $INGRESS_INTF -s 10.0.14.2 -d 10.0.15.8 -p tcp -j ACCEPT
-A $INGRESS_CHAIN -i $INGRESS_INTF -s 10.0.11.2 -d 10.0.12.8 -p tcp -j ACCEPT
-A $INGRESS_CHAIN -i $INGRESS_INTF -d 10.0.16.8 -p udp -j DROP
cumulus@switch:~$ cat /etc/cumulus/acl/policy.d/01sample_datapath.rules
INGRESS_INTF = swp+
INGRESS_CHAIN = INPUT, FORWARD
[iptables]
-A $INGRESS_CHAIN -i $INGRESS_INTF -s 192.0.2.5 -p icmp -j ACCEPT
-A $INGRESS_CHAIN -i $INGRESS_INTF -s 192.0.2.6 -d 192.0.2.4 -j DROP
-A $INGRESS_CHAIN -i $INGRESS_INTF -s 192.0.2.2 -d 192.0.2.8 -j DROP
Apply all rules and policies included in /etc/cumulus/acl/policy.conf:
cumulus@switch:~$ sudo cl-acltool -i
Specify the Policy Files to Install
By default, Cumulus Linux installs any .rules file you configure in /etc/cumulus/acl/policy.d/. To add other policy files to an ACL, you need to include them in /etc/cumulus/acl/policy.conf. For example, for Cumulus Linux to install a rule in a policy file called 01_new.datapathacl, add include /etc/cumulus/acl/policy.d/01_new.rules to policy.conf:
cumulus@switch:~$ sudo nano /etc/cumulus/acl/policy.conf
#
# This file is a master file for acl policy file inclusion
#
# Note: This is not a file where you list acl rules.
#
# This file can contain:
# - include lines with acl policy files
# example:
# include <filepath>
#
# see manpage cl-acltool(5) and cl-acltool(8) for how to write policy files
#
include /etc/cumulus/acl/policy.d/01_new.datapathacl
Hardware Limitations on Number of Rules
The maximum number of rules that the hardware process depends on:
The mix of IPv4 and IPv6 rules; Cumulus Linux does not support the maximum number of rules for both IPv4 and IPv6 simultaneously.
The number of default rules that Cumulus Linux provides.
Whether the rules apply on ingress or egress.
Whether the rules are in atomic or nonatomic mode; Cumulus Linux uses nonatomic mode rules when you enable nonatomic updates (see above).
If you exceed the maximum number of rules for a particular table, cl-acltool -i generates the following error:
error: hw sync failed (sync_acl hardware installation failed) Rolling back .. failed.
In the table below, the default rules count toward the limits listed. The raw limits below assume only one ingress and one egress table are present.
The NVIDIA Spectrum ASIC has one common TCAM for both ingress and egress, which you can use for other non-ACL-related resources. However, the number of supported rules varies with the TCAM profile for the switch.
Profile
Atomic Mode IPv4 Rules
Atomic Mode IPv6 Rules
Nonatomic Mode IPv4 Rules
Nonatomic Mode IPv6 Rules
default
500
250
1000
500
ipmc-heavy
750
500
1500
1000
acl-heavy
1750
1000
3500
2000
ipmc-max
1000
500
2000
1000
ip-acl-heavy
6000
0
12000
0
Even though the table above specifies the ip-acl-heavy profile supports no IPv6 rules, Cumulus Linux does not prevent you from configuring IPv6 rules. However, there is no guarantee that IPv6 rules work under the ip-acl-heavy profile.
The ip-acl-heavy profile shows an updated number of supported atomic mode and nonatomic mode IPv4 rules. The previously published numbers were 7500 for atomic mode and 15000 for nonatomic mode IPv4 rules.
Supported Rule Types
The iptables/ip6tables/ebtables construct tries to layer the Linux implementation on top of the underlying hardware but they are not always directly compatible. Here are the supported rules for chains in iptables, ip6tables and ebtables.
To learn more about any of the options shown in the tables below, run iptables -h [name of option]. The same help syntax works for options for ip6tables and ebtables.
root@leaf1# ebtables -h tricolorpolice
...
tricolorpolice option:
--set-color-mode STRING setting the mode in blind or aware
--set-cir INT setting committed information rate in kbits per second
--set-cbs INT setting committed burst size in kbyte
--set-pir INT setting peak information rate in kbits per second
--set-ebs INT setting excess burst size in kbyte
--set-conform-action-dscp INT setting dscp value if the action is accept for conforming packets
--set-exceed-action-dscp INT setting dscp value if the action is accept for exceeding packets
--set-violate-action STRING setting the action (accept/drop) for violating packets
--set-violate-action-dscp INT setting dscp value if the action is accept for violating packets
Supported chains for the filter table:
INPUT FORWARD OUTPUT
Rules with input/output Ethernet interfaces do not apply Inverse matches
Standard Targets
ACCEPT, DROP
RETURN, QUEUE, STOP, Fall Thru, Jump
Extended Targets
LOG (IPv4/IPv6); UID is not supported for LOG TCP SEQ, TCP options or IP options ULOG SETQOS DSCP Unique to Cumulus Linux: SPAN ERSPAN (IPv4/IPv6) POLICE TRICOLORPOLICE SETCLASS
ebtables Rule Support
Rule Element
Supported
Unsupported
Matches
ether type input interface/wildcard output interface/wildcard Src/Dst MAC IP: src, dest, tos, proto, sport, dport IPv6: tclass, icmp6: type, icmp6: code range, src/dst addr, sport, dport 802.1p (CoS) VLAN
Rules that have no matches and accept all packets in a chain are currently ignored.
Chain default rules (that are ACCEPT) are also ignored.
Considerations
Splitting rules across the ingress TCAM and the egress TCAM causes the ingress IPv6 part of the rule to match packets going to all destinations, which can interfere with the regular expected linear rule match in a sequence. For example:
A higher rule can prevent a lower rule from matching:
Rule 1 matches all icmp6 packets from to all out interfaces in the ingress TCAM.
This prevents rule 2 from matching, which is more specific but with a different out interface. Make sure to put more specific matches above more general matches even if the output interfaces are different.
When you have two rules with the same output interface, the lower rule might match depending on the presence of the previous rules.
Rule 1: -A FORWARD -o vlan100 -p icmp6 -j ACCEPT
Rule 2: -A FORWARD -o vlan101 -s 00::01 -j DROP
Rule 3: -A FORWARD -o vlan101 -p icmp6 -j ACCEPT
Rule 3 still matches for an icmp6 packet with sip 00:01 going out of vlan101. Rule 1 interferes with the normal function of rule 2 and/or rule 3.
When you have two adjacent rules with the same match and different output interfaces, such as:
Rule 1: -A FORWARD -o vlan100 -p icmp6 -j ACCEPT
Rule 2: -A FORWARD -o vlan101 -p icmp6 -j DROP
Rule 2 never matches on ingress. Both rules share the same mark.
Common Examples
Data Plane Policers
You can configure quality of service for traffic on the data plane. By using QoS policers, you can rate limit traffic so incoming packets get dropped if they exceed specified thresholds.
Counters on POLICE ACL rules in iptables do not show dropped packets due to those rules.
The following example rate limits the incoming traffic on swp1 to 400 packets per second with a burst of 200 packets per second:
cumulus@switch:~$ nv set acl example1 type ipv4
cumulus@switch:~$ nv set acl example1 rule 10 action police
cumulus@switch:~$ nv set acl example1 rule 10 action police mode packet
cumulus@switch:~$ nv set acl example1 rule 10 action police burst 200
cumulus@switch:~$ nv set acl example1 rule 10 action police rate 400
cumulus@switch:~$ nv set interface swp1 acl example1 inbound
cumulus@switch:~$ nv config apply
Use the POLICE target with iptables. POLICE takes these arguments:
--set-rate value specifies the maximum rate in kilobytes (KB) or packets.
--set-burst value specifies the number of packets or kilobytes (KB) allowed to arrive sequentially.
--set-mode string sets the mode in KB (kilobytes) or pkt (packets) for rate and burst size.
For example, to rate limit the incoming traffic on swp1 to 400 packets per second with a burst of 200 packets per second and set this rule in your appropriate .rules file:
You can configure quality of service for traffic on the control plane and rate limit traffic so incoming packets drop if they exceed certain thresholds in the following ways:
Run NVUE commands.
Edit the /etc/cumulus/control-plane/policers.conf file.
Cumulus Linux 5.0 and later no longer uses INPUT chain rules to configure control plane policers.
To configure control plane policers:
Set the burst rate for the trap group with the nv set system control-plane policer <trap-group> burst <value> command. The burst rate is the number of packets or kilobytes (KB) allowed to arrive sequentially.
Set the forwarding rate for the trap group with the nv set system control-plane policer <trap-group> rate <value> command. The forwarding rate is the maximum rate in kilobytes (KB) or packets.
The trap group can be: arp, bfd, pim-ospf-rip, bgp, clag, icmp-def, dhcp-ptp, igmp, ssh, icmp6-neigh, icmp6-def-mld, lacp, lldp, rpvst, eapol, ip2me, acl-log, nat, stp, l3-local, span-cpu, catch-all, or NONE.
The following example changes the PIM trap group forwarding rate and burst rate to 400 packets per second, and the IGMP trap group forwarding rate to 400 packets per second and burst rate to 200 packets per second:
cumulus@switch:~$ nv set system control-plane policer pim-ospf-rip rate 400
cumulus@switch:~$ nv set system control-plane policer pim-ospf-rip burst 400
cumulus@switch:~$ nv set system control-plane policer pim-ospf-rip state on
cumulus@switch:~$ nv set system control-plane policer igmp rate 400
cumulus@switch:~$ nv set system control-plane policer igmp burst 200
cumulus@switch:~$ nv config apply
To rate limit traffic using the /etc/cumulus/control-plane/policers.conf file, you:
Enable an individual policer for a trap group (set enable to TRUE).
Set the policer rate in packets per second. The forwarding rate is the maximum rate in kilobytes (KB) or packets.
Set the policer burst rate in packets per second. The burst rate is the number of packets or kilobytes (KB) allowed to arrive sequentially.
After you edit the /etc/cumulus/control-plane/policers.conf file, you must reload the file with the /usr/lib/cumulus/switchdctl --load /etc/cumulus/control-plane/policers.conf command.
When enable is FALSE for a trap group, the trap group and catch-all trap group have a shared policer. When enable is TRUE, Cumulus Linux creates an individual policer for the trap group.
The following example changes the PIM trap group forwarding rate and burst rate to 400 packets per second, and the IGMP trap group forwarding rate to 400 packets per second and burst rate to 200 packets per second:
To show the control plane police configuration and statistics, run the NVUE nv show system control-plane policer --view=statistics command.
Cumulus Linux provides default control plane policer values. You can adjust these values to accommodate higher scale requirements for specific protocols as needed.
You can configure control plane ACLs to apply a single rule for all packets forwarded to the CPU regardless of the source interface or destination interface on the switch. Control plane ACLs allow you to regulate traffic forwarded to applications on the switch with more granularity than traps and to configure ACLs to block SSH from specific addresses or subnets.
Cumulus Linux applies inbound control plane ACLs in the INPUT chain and outbound control plane ACLs in the OUTPUT chain.
Cumulus Linux does not support a deny all control plane rule. This type of rule blocks traffic for interprocess communication and impacts overall system functionality.
The following example command applies the input control plane ACL called ACL1.
cumulus@switch:~$ nv set system control-plane acl ACL1 inbound
cumulus@switch:~$ nv config apply
The following example command applies the output control plane ACL called ACL2.
cumulus@switch:~$ nv set system control-plane acl ACL2 outbound
cumulus@switch:~$ nv config apply
To show statistics for all control-plane ACLs, run the nv show system control-plane acl command:
cumulus@switch:~$ nv show system control-plane acl
ACL Name Rule ID In Packets In Bytes Out Packets Out Bytes
--------- ------- ---------- -------- ----------- ---------
acl1 1 0 0 0 0
65535 0 0 0 0
acl2 1 0 0 0 0
65535 0 0 0 0
To show statistics for a specific control-plane ACL, run the nv show system control-plane acl <acl_name> statistics command:
cumulus@switch:~$ nv show system control-plane acl ACL1 statistics
Rule In Packet In Byte Out Packet Out Byte Summary
---- --------- ------- ---------- -------- ---------------------------
1 0 0 Bytes 0 0 Bytes match.ip.dest-ip: 9.1.2.3
2 0 0 Bytes 0 0 Bytes match.ip.source-ip: 7.8.2.3
Set DSCP on Transit Traffic
The examples here use the mangle table to modify the packet as it transits the switch. DSCP is in decimal notation in the examples below.
[iptables]
#Set SSH as high priority traffic.
-t mangle -A PREROUTING -i swp+ -p tcp -m multiport --dports 22 -j SETQOS --set-dscp 46
#Set everything coming in swp1 as AF13
-t mangle -A PREROUTING -i swp1 -j SETQOS --set-dscp 14
#Set Packets destined for 10.0.100.27 as best effort
-t mangle -A PREROUTING -i swp+ -d 10.0.100.27/32 -j SETQOS --set-dscp 0
#Example using a range of ports for TCP traffic
-t mangle -A PREROUTING -i swp+ -s 10.0.0.17/32 -d 10.0.100.27/32 -p tcp -m multiport --sports 10000:20000 -m multiport --dports 10000:20000 -j SETQOS --set-dscp 34
Apply the rule:
cumulus@switch:~$ sudo cl-acltool -i
To set SSH as high priority traffic:
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip protocol tcp
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip dest-port 22
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 action set dscp 46
cumulus@switch:~$ nv set interface swp1-48 acl EXAMPLE1 inbound
cumulus@switch:~$ nv config apply
To set everything coming in swp1 as AF13:
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 action set dscp 14
cumulus@switch:~$ nv set interface swp1 acl EXAMPLE1 inbound
cumulus@switch:~$ nv config apply
To set Packets destined for 10.0.100.27 as best effort:
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip dest-ip 10.0.100.27/32
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 action set dscp 0
cumulus@switch:~$ nv set interface swp1-48 acl EXAMPLE1 inbound
cumulus@switch:~$ nv config apply
To use a range of ports for TCP traffic:
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip protocol tcp
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip source-ip 10.0.0.17/32
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip source-port 10000:20000
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip dest-ip 10.0.100.27/32
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip dest-port 10000:20000
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 action set dscp 34
cumulus@switch:~$ nv set interface swp1-48 acl EXAMPLE1 inbound
cumulus@switch:~$ nv config apply
To specify all ports on the switch in NVUE (swp+ in an iptables rule), you must set the range of interfaces on the switch as in the examples above (nv set interface swp1-48). This command creates as many rules in the /etc/cumulus/acl/policy.d/50_nvue.rules file as the number of interfaces in the range you specify.
Filter Specific TCP Flags
The example rule below drops ingress IPv4 TCP packets when you set the SYN bit and reset the RST, ACK, and FIN bits. The rule applies inbound on interface swp1. After configuring this rule, you cannot establish new TCP sessions that originate from ingress port swp1. You can establish TCP sessions that originate from any other port.
-t mangle -A PREROUTING -i swp1 -p tcp --tcp-flags ACK,SYN,FIN,RST SYN -j DROP
Apply the rule:
cumulus@switch:~$ sudo cl-acltool -i
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 20 match ip protocol tcp
cumulus@switch:~$ nv set acl EXAMPLE1 rule 20 match ip tcp flags syn
cumulus@switch:~$ nv set acl EXAMPLE1 rule 20 match ip tcp mask rst
cumulus@switch:~$ nv set acl EXAMPLE1 rule 20 match ip tcp mask syn
cumulus@switch:~$ nv set acl EXAMPLE1 rule 20 match ip tcp mask fin
cumulus@switch:~$ nv set acl EXAMPLE1 rule 20 match ip tcp mask ack
cumulus@switch:~$ nv set acl EXAMPLE1 rule 20 action deny
cumulus@switch:~$ nv set interface swp1 acl EXAMPLE1 inbound
cumulus@switch:~$ nv config apply
Control Who Can SSH into the Switch
Run the following commands to control who can SSH into the switch.
In the following example, 10.10.10.1/32 is the interface IP address (or loopback IP address) of the switch and 10.255.4.0/24 can SSH into the switch.
-A INPUT -i swp+ -s 10.255.4.0/24 -d 10.10.10.1/32 -j ACCEPT
-A INPUT -i swp+ -d 10.10.10.1/32 -j DROP
Apply the rule:
cumulus@switch:~$ sudo cl-acltool -i
cumulus@switch:~$ nv set acl example2 type ipv4
cumulus@switch:~$ nv set acl example2 rule 10 match ip source-ip 10.255.4.0/24
cumulus@switch:~$ nv set acl example2 rule 10 match ip dest-ip 10.10.10.1/32
cumulus@switch:~$ nv set acl example2 rule 10 action permit
cumulus@switch:~$ nv set acl example2 rule 20 match ip source-ip ANY
cumulus@switch:~$ nv set acl example2 rule 20 match ip dest-ip 10.10.10.1/32
cumulus@switch:~$ nv set acl example2 rule 20 action deny
cumulus@switch:~$ nv set system control-plane acl example2 inbound
cumulus@switch:~$ nv config apply
Match on ECN Bits in the TCP IP Header
ECN allows end-to-end notification of network congestion without dropping packets. You can add ECN rules to match on the ECE, CWR, and ECT flags in the TCP IPv4 header.
By default, ECN rules match a packet with the bit set. You can reverse the match by using an explanation point (!).
Match on the ECE Bit
After an endpoint receives a packet with the CE bit set by a router, it sets the ECE bit in the returning ACK packet to notify the other endpoint that it needs to slow down.
To match on the ECE bit:
Create a rules file in the /etc/cumulus/acl/policy.d directory and add the following rule under [iptables]:
cumulus@switch:~$ nv set acl example2 type ipv4
cumulus@switch:~$ nv set acl example2 rule 10 match ip protocol tcp
cumulus@switch:~$ nv set acl example2 rule 10 match ip ecn flags tcp-cwr
cumulus@switch:~$ nv set acl example2 rule 10 action permit
cumulus@switch:~$ nv set interface swp1 acl example2 inbound
cumulus@switch:~$ nv config apply
Match on the ECT Bit
The ECT codepoints negotiate if the connection is ECN capable by setting one of the two bits to 1. Routers also use the ECT bit to indicate that they are experiencing congestion by setting both the ECT codepoints to 1.
To match on the ECT bit:
Create a rules file in the /etc/cumulus/acl/policy.d directory and add the following rule under [iptables]:
cumulus@switch:~$ nv set acl example2 type ipv4
cumulus@switch:~$ nv set acl example2 rule 10 match ip protocol tcp
cumulus@switch:~$ nv set acl example2 rule 10 match ip ecn ip-ect 1
cumulus@switch:~$ nv set acl example2 rule 10 action permit
cumulus@switch:~$ nv set interface swp1 acl example2 inbound
cumulus@switch:~$ nv config apply
Example Configuration
The following example demonstrates how Cumulus Linux applies several different rules.
Egress Rule
The following rule blocks any TCP traffic with destination port 200 going through leaf01 to server01 (rule 1 in the diagram above).
[iptables]
-t mangle -A POSTROUTING -o swp1 -p tcp -m multiport --dports 200 -j DROP
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip protocol tcp
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip dest-port 200
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 action deny
cumulus@switch:~$ nv set interface swp1 acl EXAMPLE1 outbound
cumulus@switch:~$ nv config apply
Ingress Rule
The following rule blocks any UDP traffic with source port 200 going from server01 through leaf01 (rule 2 in the diagram above).
[iptables]
-t mangle -A PREROUTING -i swp1 -p udp -m multiport --sports 200 -j DROP
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip protocol udp
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip source-port 200
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 action deny
cumulus@switch:~$ nv set interface swp1 acl EXAMPLE1 inbound
cumulus@switch:~$ nv config apply
Input Rule
The following rule blocks any UDP traffic with source port 200 and destination port 50 going from server02 to the leaf02 control plane (rule 3 in the diagram above).
[iptables]
-A INPUT -i swp2 -p udp -m multiport --dports 50 -j DROP
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip protocol udp
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip dest-port 50
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 action deny
cumulus@switch:~$ nv set interface swp2 acl EXAMPLE1 inbound control-plane
cumulus@switch:~$ nv config apply
Output Rule
The following rule blocks any TCP traffic with source port 123 and destination port 123 going from leaf02 to server02 (rule 4 in the diagram above).
[iptables]
-A OUTPUT -o swp2 -p tcp -m multiport --sports 123 -m multiport --dports 123 -j DROP
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip protocol tcp
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip source-port 123
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 match ip dest-port 123
cumulus@switch:~$ nv set acl EXAMPLE1 rule 10 action deny
cumulus@switch:~$ nv set interface swp2 acl EXAMPLE1 outbound control-plane
cumulus@switch:~$ nv config apply
Layer 2 Rules (ebtables)
The following rule blocks any traffic with source MAC address 00:00:00:00:00:12 and destination MAC address 08:9e:01:ce:e2:04 going from any switch port egress or ingress.
[ebtables]
-A FORWARD -s 00:00:00:00:00:12 -d 08:9e:01:ce:e2:04 -j DROP
cumulus@switch:~$ nv set acl EXAMPLE type mac
cumulus@switch:~$ nv set acl EXAMPLE rule 10 match mac source-mac 00:00:00:00:00:12
cumulus@switch:~$ nv set acl EXAMPLE rule 10 match mac dest-mac 08:9e:01:ce:e2:04
cumulus@switch:~$ nv set acl EXAMPLE rule 10 action deny
cumulus@switch:~$ nv set interface swp1-48 acl EXAMPLE inbound
cumulus@switch:~$ nv config apply
Considerations
Not All Rules Supported
Cumulus Linux does not support all iptables, ip6tables, or ebtables rules. Refer to Supported Rules for specific rule support.
ACL Log Policer Limits Traffic
To protect the CPU from overloading, Cumulus Linux limits traffic copied to the CPU to 1 packet per second by an ACL Log Policer.
Bridge Traffic Limitations
Bridge traffic that matches LOG ACTION rules do not log to syslog; the kernel and hardware identify packets using different information.
You Cannot Forward Log Actions
You cannot forward logged packets. The hardware cannot both forward a packet and send the packet to the control plane (or kernel) for logging. A log action must also have a drop action.
SPAN Sessions that Reference an Outgoing Interface
Because Cumulus Linux is a Linux operating system, you can use the iptables commands. However, consider using cl-acltool instead for the following reasons:
Without using cl-acltool, rules do not install into hardware.
Running cl-acltool -i (the installation command) resets all rules and deletes anything that is not in the /etc/cumulus/acl/policy.conf file.
For example, running the following command works:
cumulus@switch:~$ sudo iptables -A INPUT -p icmp --icmp-type echo-request -j DROP
The rules appear when you run cl-acltool -L:
cumulus@switch:~$ sudo cl-acltool -L ip
-------------------------------
Listing rules of type iptables:
-------------------------------
TABLE filter :
Chain INPUT (policy ACCEPT 72 packets, 5236 bytes)
pkts bytes target prot opt in out source destination
0 0 DROP icmp -- any any anywhere anywhere icmp echo-request
However, running cl-acltool -i or reboot removes them. To ensure that Cumulus Linux can hardware accelerate all rules that can be in hardware, place them in the /etc/cumulus/acl/policy.conf file, then run cl-acltool -i.
Where to Assign Rules
If you assign a switch port to a bond, you must assign any egress rules to the bond.
When using the OUTPUT chain, you must assign rules to the source. For example, if you assign a rule to the switch port in the direction of traffic but the source is a bridge (VLAN), the rule does not affect the traffic and you must apply the rule to the bridge.
If you need to apply a rule to all transit traffic, use the FORWARD chain, not the OUTPUT chain.
ACL Rule Installation Failure
After an ACL rule installation failure, you see a generic error message like the following:
cumulus@switch:$ sudo cl-acltool -i -p 00control_plane.rules
Using user provided rule file 00control_plane.rules
Reading rule file 00control_plane.rules ...
Processing rules in file 00control_plane.rules ...
error: hw sync failed (sync_acl hardware installation failed)
Installing acl policy... Rolling back ..
failed.
ACLs Do not Match when the Output Port on the ACL is a Subinterface
The ACL does not match on packets when you configure a subinterface as the output port. The ACL matches on packets only if the primary port is as an output port. If a subinterface is an output or egress port, the packets match correctly.
For example:
-A FORWARD -o swp49s1.100 -j ACCEPT
Egress ACL Matching on Bonds
Cumulus Linux does not support ACL rules that match on an outbound bond interface. For example, you cannot create the following rule:
[iptables]
-A FORWARD -o <bond_intf> -j DROP
To work around this issue, duplicate the ACL rule on each physical port of the bond. For example:
[iptables]
-A FORWARD -o <bond-member-port-1> -j DROP
-A FORWARD -o <bond-member-port-2> -j DROP
SSH Traffic to the Management VRF
To allow SSH traffic to the management VRF, use -i mgmt, not -i eth0. For example:
In INPUT chain rules, the -i swp+ match works only if the destination of the packet is towards a layer 3 swp interface; the match does not work if the packet terminates at an SVI interface (for example, vlan10). To allow traffic towards specific SVIs, use rules without any interface match or rules with individual -i <SVI> matches.
Services (also known as daemons) and processes are at the heart of how a Linux system functions. Most of the time, a service takes care of itself; you just enable and start it, then let it run. However, because a Cumulus Linux switch is a Linux system, you can dig deeper if you like. Services can start multiple processes as they run. Services are important to monitor on a Cumulus Linux switch.
You manage services in Cumulus Linux in the following ways:
Identify all active or stopped services
Identify boot time state of a specific service
Disable or enable a specific service
Identify active listener ports
systemd and the systemctl Command
You manage services using systemd with the systemctl command. You run the systemctl command with any service on the switch to start, stop, restart, reload, enable, disable, reenable, or get the status of the service.
systemctl has commands that perform a specific operation on a given service:
status returns the status of the specified service.
start starts the service.
stop stops the service.
restart stops, then starts the service, all the while maintaining state. If there are dependent services or services that mark the restarted service as Required, the other services also restart. For example, running systemctl restart frr.service restarts any of the routing protocol services that you enable and that are running, such as bgpd or ospfd.
reload reloads the configuration for the service.
enable enables the service to start when the system boots, but does not start it unless you use the systemctl start SERVICENAME.service command or reboot the switch.
disable disables the service, but does not stop it unless you use the systemctl stop SERVICENAME.service command or reboot the switch. You can start or stop a disabled service.
reenable disables, then enables a service. Run this command so that any new Wants or WantedBy lines create the symlinks necessary for ordering. This has no side effects on other services.
You do not need to interact with the services directly using these commands. If a critical service crashes or encounters an error, systemd restarts it automatically. systemd is the caretaker of services in modern Linux systems and responsible for starting all the necessary services at boot time.
Ensure a Service Starts after Multiple Restarts
By default, systemd tries to restart a particular service only a certain number of times within a given interval before the service fails to start. The settings StartLimitInterval (which defaults to 10 seconds) and StartBurstLimit (which defaults to 5 attempts) are in the service script; however, certain services override these defaults, sometimes with much longer times. For example, switchd.service sets StartLimitInterval=10m and StartBurstLimit=3; therefore, if you restart switchd more than three times in ten minutes, it does not start.
When the restart fails for this reason, you see a message similar to the following:
Job for switchd.service failed. See 'systemctl status switchd.service' and 'journalctl -xn' for details.
systemctl status switchd.service shows output similar to:
Active: failed (Result: start-limit) since Thu 2016-04-07 21:55:14 UTC; 15s ago
To clear this error, run systemctl reset-failed switchd.service. If you know you are going to restart frequently (multiple times within the StartLimitInterval), you can run the same command before you issue the restart request. This also applies to stop followed by start.
Keep systemd Services from Hanging after Starting
If you start, restart, or reload a systemd service that you can start from another systemd service, you must use the --no-block option with systemctl.
Identify Active Listener Ports for IPv4 and IPv6
You can identify the active listener ports under both IPv4 and IPv6 using the netstat command:
To see active or stopped services, run the cl-service-summary command:
cumulus@switch:~$ cl-service-summary
Service cron enabled active
Service ssh enabled active
Service syslog enabled active
Service asic-monitor enabled inactive
Service clagd enabled inactive
Service cumulus-poe inactive
Service lldpd enabled active
Service mstpd enabled active
Service neighmgrd enabled active
Service nvued enabled active
Service netq-agent enabled active
Service ntp enabled active
Service ptmd enabled active
Service pwmd enabled active
Service smond enabled active
Service switchd enabled active
Service sysmonitor enabled active
Service rdnbrd disabled inactive
Service frr enabled inactive
...
You can also run the systemctl list-unit-files --type service command to list all services on the switch and to see their status:
The switchd service enables the switch to communicate with Cumulus Linux and all the applications running on Cumulus Linux.
Configure switchd Settings
You can control certain options associated with the switchd process. For example, you can set polling intervals, optimize ACL hardware resources for better utilization, configure log message levels, set the internal VLAN range, and configure VXLAN encapsulation and decapsulation.
To configure switchd options, you either run NVUE commands or manually edit the /etc/cumulus/switchd.conf file.
NVUE currently only supports a subset of the switchd configuration available in the /etc/cumulus/switchd.conf file.
You can run NVUE commands to set the following switchd options:
The statistic polling interval for physical interfaces and for logical interfaces.
For physical interfaces, you can specify a value between 1 and 10. The default setting is 2 seconds
For logical interfaces, you can specify a value between 1 and 30. The default setting is 5 seconds.
A low setting, such as 1, might affect system performance.
The log level to debug the data plane programming related code. You can specify debug, info, notice, warning, or error. The default setting is info. NVIDIA recommends that you do not set the log level to debug in a production environment.
The DSCP action and value for encapsulation. You can set the DSCP action to copy (to copy the value from the IP header of the packet), set (to specify a specific value), or derive (to obtain the value from the switch priority). The default action is derive. Only specify a value if the action is set.
The DSCP action for decapsulation in VXLAN outer headers. You can specify copy (to copy the value from the IP header of the packet), preserve (to keep the inner DSCP value), or derive (to obtain the value from the switch priority). The default action is derive.
The preference between a route and neighbor with the same IP address and mask. You can specify route, neighbor, or route-and-neighbour. The default setting is route.
The ACL mode (atomic or non-atomic). The default setting is atomic.
The reserved VLAN range. The default setting is 3725-3999.
Certain switchd settings require a switchd restart or reload. Before applying the settings, NVUE indicates if it requires a switchd restart or reload and prompts you for confirmation.
When the switchd service restarts, in addition to resetting the switch hardware configuration, all network ports reset.
When the switchd service reloads, there is no interruption to network services.
The following command example sets both the statistic polling interval for logical interfaces and physical interfaces to 6 seconds:
cumulus@switch:~$ nv set system counter polling-interval logical-interface 6
cumulus@switch:~$ nv set system counter polling-interval physical-interface 6
cumulus@switch:~$ nv config apply
The following command example sets the log level for debugging the data plane programming related code to warning:
cumulus@switch:~$ nv set system forwarding programming log-level warning
cumulus@switch:~$ nv config apply
The following command example sets the DSCP action for encapsulation in VXLAN outer headers to set and the value to af12:
cumulus@switch:~$ nv set nve vxlan encapsulation dscp action set
cumulus@switch:~$ nv set nve vxlan encapsulation dscp value af12
cumulus@switch:~$ nv config apply
The following command example sets the DSCP action for decapsulation in VXLAN outer headers to preserve:
The following command example sets the route or neighbour preference to both route and neighbour:
cumulus@switch:~$ nv set system forwarding host-route-preference route-and-neighbour
cumulus@switch:~$ nv config apply
The following command example sets the ACL mode to non-atomic:
cumulus@switch:~$ nv set system acl mode non-atomic
cumulus@switch:~$ nv config apply
The following command example sets the reserved VLAN range between 4064 and 4094:
cumulus@switch:~$ nv set system global reserved vlan internal range 4064-4094
cumulus@switch:~$ nv config apply
To configure the switchd parameters, edit the /etc/cumulus/switchd.conf file. Change the setting and uncomment the line if needed. The switchd.conf file contains comments with a description for each setting.
The following example shows the first few lines of the /etc/cumulus/switchd.conf file.
The following table describes the /etc/cumulus/switchd.conf file parameters and indicates if you need to restart switchd with the sudo systemctl restart switchd.service command or reload switchd with the sudo systemctl reload switchd.service command for changes to take effect when you update the setting.
Restarting the switchd service causes all network ports to reset in addition to resetting the switch hardware configuration.
Parameter
Description
switchd reload or restart
stats.poll_interval
The statistics polling interval in milliseconds.The default setting is 2000.
restart
buf_util.poll_interval
The buffer utilization polling interval in milliseconds. 0 disables buffer utilization polling.The default setting is 0.
restart
buf_util.measure_interval
The buffer utilization measurement interval in minutes.The default setting is 0.
restart
acl.optimize_hw
Optimizes ACL hardware resources for better utilization.The default setting is FALSE.
restart
acl.flow_based_mirroring
Enables flow-based mirroring.The default setting is TRUE.
restart
acl.non_atomic_update_mode
Enables non atomic ACL updatesThe default setting is FALSE.
reload
arp.next_hops
Sends ARPs for next hops.The default setting is TRUE.
restart
route.table
The kernel routing table ID. The range is between 1 and 2^31. The default is 254.
restart
route.host_max_percent
The maximum neighbor table occupancy in hardware (a percentage of the hardware table size).The default setting is 100.
restart
coalescing.reducer
The coalescing reduction factor for accumulating changes to reduce CPU load.The default setting is 1.
restart
coalescing.timeout
The coalescing time limit in seconds.The default setting is 10.
restart
ignore_non_swps
Ignore routes that point to non-swp interfaces.The default setting is TRUE.
restart
disable_internal_parity_restart
Disables restart after a parity error.The default setting is TRUE.
restart
disable_internal_hw_err_restart
Disables restart after an unrecoverable hardware error.The default setting is FALSE.
restart
nat.static_enable
Enables static NAT. The default setting is TRUE.
restart
nat.dynamic_enable
Enables dynamic NAT. The default setting is TRUE.
restart
nat.age_poll_interval
The NAT age polling interval in minutes. The minimum is 1 minute and the maximum is 24 hours. You can configure this setting only when nat.dynamic_enable is set to TRUE. The default setting is 5.
restart
nat.table_size
The NAT table size limit in number of entries. You can configure this setting only when nat.dynamic_enable is set to TRUE. The default setting is 1024.
restart
nat.config_table_size
The NAT configuration table size limit in number of entries. You can configure this setting only when nat.dynamic_enable is set to TRUE. The default setting is 64.
restart
logging
Configures logging in the format BACKEND=LEVEL. Separate multiple BACKEND=LEVEL pairs with a space. The BACKEND value can be stderr, file:filename, syslog, program:executable. The LEVEL value can be CRIT, ERR, WARN, INFO, DEBUG.The default value is syslog=INFO
restart
interface.swp1.storm_control.broadcast
Enables broadcast storm control and sets the number of packets per second (pps).The default setting is 400.
reload
interface.swp1.storm_control.multicast
Enables multicast storm control and sets the number of packets per second (pps).The default setting is 3000.
reload
interface.swp1.storm_control.unknown_unicast
Enables unicast storm control and sets the number of packets per second (pps).The default setting is 2000.
reload
stats.vlan.aggregate
Enables hardware statistics for VLANs and specifies the type of statistics needed. You can specify NONE, BRIEF, or DETAIL.The default setting is BRIEF.
restart
stats.vxlan.aggregate
Enables hardware statistics for VXLANs and specifies the type of statistics needed. You can specify NONE, BRIEF, or DETAIL. The default setting is DETAIL.
restart
stats.vxlan.member
Enables hardware statistics for VXLAN members and specifies the type of statistics needed. You can specify NONE, BRIEF, or DETAIL.The default setting is BRIEF.
restart
stats.vlan.show_internal_vlans
Show internal VLANs.The default setting is FALSE.
restart
stats.vdev_hw_poll_interval
The polling interval in seconds for virtual device hardware statisitcs.The default setting is 5.
restart
resv_vlan_range
The internal VLAN range.The default setting is 3725-3999.
restart
netlink.buf_size
The netlink socket buffer size in MB.The default setting is 136314880.
restart
route.delete_dead_routes
Delete routes on interfaces when the carrier is down.The default setting is TRUE.
restart
vxlan.default_ttl
The default TTL to use in VXLAN headers.The default setting is 64.
restart
bridge.broadcast_frame_to_cpu
Enables bridge broadcast frames to the CPU even if the SVI is not enabled.The default setting is FALSE.
restart
bridge.unreg_mcast_init
Initialize the prune module for IGMP snooping unregistered layer 2 multicast flood control.The default setting is FALSE.
restart
bridge.unreg_v4_mcast_prune
Enables unregistered layer 2 multicast prune to mrouter ports (IPv4).The default setting is FALSE (flood unregistered layer 2 multicast traffic).
restart
bridge.unreg_v6_mcast_prune
Enables unregistered layer 2 multicast prune to mrouter ports (IPv6).The default setting is FALSE (flood unregistered layer 2 multicast traffic).
restart
netlink libnl logger
The default setting is [0-5].
restart
netlink.nl_logger
The default setting is 0.
restart
vxlan.def_encap_dscp_action
Sets the default VXLAN router DSCP action during encapsulation. You can specify copy if the inner packet is IP, set to set a specific value, or derive to derive the value from the switch priority.The default setting is derive.
restart
vxlan.def_encap_dscp_value
Sets the default VXLAN encapsulation DSCP value if the action is set.
restart
vxlan.def_decap_dscp_action
Sets the default VXLAN router DSCP action during decapsulation. You can specify copy if the inner packet is IP, preserve to preserve the inner DSCP value, or derive to derive the value from the switch priority.The default setting is derive.
restart
ipmulticast.unknown_ipmc_to_cpu
Enables sending unknown IPMC to the CPU.The default setting is FALSE.
restart
vrf_route_leak_enable_dynamic
Enables dynamic VRF route leaking.The default setting is FALSE.
restart
sync_queue_depth_val
The event queue depth.The default setting is 50000.
restart
route.route_preferred_over_neigh
Sets the preference between a route and neighbor with the same IP address and mask. You can specify TRUE to prefer the route over the neighbor, FALSE to prefer the neighbor over the route, or BOTH to install both the route and neighbor.The default setting is TRUE.
restart
evpn.multihoming.enable
Enables EVPN multihoming.The default setting is TRUE.
restart
evpn.multihoming.shared_l2_groups
Enables sharing for layer 2 next hop groups.The default setting is FALSE.
restart
evpn.multihoming.shared_l3_groups
Enables sharing for layer 3 next hop groups.The default setting is FALSE.
restart
evpn.multihoming.fast_local_protect
Enables fast reroute for egress link protection. The default setting is FALSE.
restart
evpn.multihoming.bum_sph_filter
Sets split-horizon filtering for EVPN multihoming. You can specify TRUE to filter only BUM traffic from the Ethernet segment (ES) peer or FALSE to filter all traffic from the ES peer.The default setting is TRUE.
restart
link_flap_window
The duration in seconds during which a link must flap the number of times set in the link_flap_threshold before Cumulus Linux sets the link to protodown and specifies linkflap as the reason.The default setting is 10. A value of 0 disables link flap protection.
restart
link_flap_threshold
The number of times the link must flap within the link flap window before Cumulus Linux sets the link to protodown and specifies linkflap as the reason.The default setting is 5. A value of 0 disables link flap protection.
restart
res_usage_warn_threshold
Sets the percentage over which forwarding resources (routes, hosts, MAC addresses) must go before Cumulus Linux generates a warning. You can set a value between 50 and 95.The default setting is 90.
restart
res_warn_msg_int
The time interval in seconds between resource warning messages. Warning messages generate only one time in the specified interval per resource type even if the threshold falls below or goes over the value set in res_usage_warn_threshold multiple times during this interval. You can set a value between 60 and 3600.The default setting is 300.
restart
Show switchd Settings
You can run the following NVUE commands to show the current switchd configuration settings.
Command
Description
nv show system counter polling-interval
Shows the polling interval for physical and logical interface counters in seconds.
nv show system forwarding programming
Shows the log level for data plane programming logs.
nv show nve vxlan encapsulation dscp
Shows the DSCP action and value (if the action is set) for the outer header in VXLAN encapsulation.
nv show nve vxlan decapsulation dscp
Shows the DSCP action for the outer header in VXLAN decapsulation.
nv show system acl
Shows the ACL mode (atomic or non-atomic).
nv show system global reserved vlan internal
Shows the reserved VLAN range.
The following example command shows that the polling interval setting for logical interface counters is 6 seconds:
cumulus@switch:~$ nv show system counter polling-interval
applied description
----------------- ------- -----------------------------------------------------
logical-interface 0:00:06 Config polling-interval for logical interface(in sec)
The following example command shows that the log level setting for data plane programming logs is warning:
cumulus@switch:~$ nv show system forwarding programming
applied description
--------- ------- -------------------
log-level warning configure Log-level
The following example command shows that the DSCP action setting for the outer header in VXLAN encapsulation is set and the value is af12.
cumulus@switch:~$ nv show nve vxlan encapsulation dscp
operational applied description
------ ----------- ------- --------------------------------------------------
action set set DSCP encapsulation action
value af12 af12 Configured DSCP value to put in outer Vxlan packet
The following command example shows that ACL mode is atomic:
cumulus@switch:~$ nv show system acl
applied description
---- ------- -----------------------------------------
mode atomic configure Atomic or Non-Atomic ACL update
The following command example shows that the reserved VLAN range is between 4064 and 4094:
cumulus@switch:~$ nv show system global reserved vlan internal
operational applied description
----- ----------- --------- -------------------
range 4064-4094 4064-4094 Reserved Vlan range
In addition to restarting switchd when you change certain /etc/cumulus/switchd.conf file parameters manually, you also need to restart switchd whenever you modify a switchd hardware configuration file (any *.conf file that requires making a change to the switching hardware, such as /etc/cumulus/datapath/traffic.conf). You do not have to restart the switchd service when you update a network interface configuration (for example, when you edit the /etc/network/interfaces file).
Configuring a Global Proxy
You configure global HTTP and HTTPS proxies in the /etc/profile.d/ directory of Cumulus Linux. Set the http_proxy and https_proxy variables to configure the switch with the address of the proxy server you want to use to get URLs on the command line. This is useful for programs such as apt, apt-get, curl and wget, which can all use this proxy.
In a terminal, create a new file in the /etc/profile.d/ directory.
Create a file in the /etc/apt/apt.conf.d directory and add the following lines to the file to get the HTTP and HTTPS proxies. The example below uses http_proxy as the file name:
Use ISSU to upgrade and troubleshoot an active switch with minimal disruption to the network.
ISSU includes the following modes:
Restart
Upgrade
Maintenance mode
Maintenance ports
In earlier Cumulus Linux releases, ISSU was Smart System Manager.
Restart Mode
You can configure the switch to restart in one of the following modes.
cold restarts the system and resets all the hardware devices on the switch (including the switching ASIC).
fast restarts the system more efficiently with minimal impact to traffic by reloading the kernel and software stack without a hard reset of the hardware. During a fast restart, the system decouples from the network to the extent possible using existing protocol extensions before recovering to the operational mode of the system. The restart process maintains the forwarding entries of the switching ASIC and the data plane is not affected. Traffic outage is much lower in this mode as there is a momentary interruption after reboot, while the system reinitializes.
warm restarts the system with no interruption to traffic for existing route entries. Warm mode diverts traffic from itself and restarts the system without a hardware reset of the switch ASIC. While this process does not affect the data plane, the control plane is absent during restart and is unable to process routing updates. However, if no alternate paths exist, the switch continues forwarding with the existing entries with no interruptions.
When you restart the switch in warm mode, BGP only performs a graceful restart if the BGP graceful restart option is set to full. To set BGP graceful restart to full, run the nv set router bgp graceful-restart mode full command, then apply the configuration with nv config apply. For more information about BGP graceful restart, refer to Optional BGP Configuration.
Cumulus Linux supports fast mode for all protocols; however only supports warm mode for layer 2 forwarding, and layer 3 forwarding with BGP and static routing.
NVIDIA recommends you use NVUE commands to configure restart mode and reboot the system. If you prefer to use csmgrctl commands, you must stop NVUE from managing the /etc/cumulus/csmgrd.conf file before you set restart mode:
Run the following NVUE commands:
cumulus@switch:~$ nv set system config apply ignore /etc/cumulus/csmgrd.conf
cumulus@switch:~$ nv config apply
Edit the /etc/cumulus/csmgrd.conf file and set the csmgrctl_override option to true:
The following command configures the switch to restart in cold mode:
cumulus@switch:~$ nv set system reboot mode cold
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo csmgrctl -c
The following command configures the switch to restart in fast mode:
cumulus@switch:~$ nv set system reboot mode fast
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo csmgrctl -f
The following command configures the switch to restart in warm mode.
cumulus@switch:~$ nv set system reboot mode warm
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo csmgrctl -w
To reboot the switch in the restart mode you configure above with NVUE:
cumulus@switch:~$ nv action reboot system no-confirm
You must specify no-confirm at the end of the command.
To show system reboot information, such as the reboot date and time, reason, and reset mode (fast, cold, warm), run the NVUE nv show system reboot command:
cumulus@switch:~$ nv show system reboot
operational applied pending
--------- -------------------------------- ------- -------
reason
gentime 2023-04-26T15:11:23.140569+00:00
reason Unknown
user system/root
mode cold cold
required no
Upgrade Mode
Upgrade mode updates all the components and services on the switch to the latest Cumulus Linux minor release without impacting traffic. After upgrade is complete, you must restart the switch with either a warm, cold, or fast restart.
If the switch is in warm restart mode, restarting the switch after an upgrade does not result in traffic loss (this is a hitless upgrade).
Upgrade mode includes the following options:
all runs apt-get upgrade to upgrade all the system components to the latest release without affecting traffic flow. You must restart the system after the upgrade completes with one of the restart modes.
dry-run provides information on the components you want to upgrade.
The following command upgrades all the system components:
The NVUE command is not supported.
cumulus@switch:~$ sudo csmgrctl -u
The following command provides information on the components you want to upgrade:
The NVUE command is not supported.
cumulus@switch:~$ sudo csmgrctl -d
Maintenance Mode
Maintenance mode globally manages the BGP and MLAG control plane.
When you enable maintenance mode, BGP and MLAG shut down gracefully.
When you disable maintenance mode, BGP and MLAG are enabled based on the individual parameter settings.
To enable maintenance mode:
cumulus@switch:~$ nv action enable system maintenance mode
Action executing ...
System maintenance mode has been enabled successfully
Current System Mode: Maintenance, cold
Maintenance mode since Thu Jun 13 23:59:47 2024 (Duration: 00:00:00)
Ports shutdown for Maintenance
frr : Maintenance, cold, down, up time: 29:06:27
switchd : Maintenance, cold, down, up time: 29:06:31
System Services : Maintenance, cold, down, up time: 29:07:00
Action succeeded
cumulus@switch:~$ sudo csmgrctl -m1
To disable maintenance mode:
cumulus@switch:~$ nv action disable system maintenance mode
Action executing ...
System maintenance mode has been disabled successfully
Current System Mode: cold
frr : cold, up, up time: 12:57:48 (1 restart)
switchd : cold, up, up time: 13:12:13
System Services : cold, up, up time: 13:12:32
Action succeeded
cumulus@switch:~$ sudo csmgrctl -m0
Before you disable maintenance mode, be sure to bring the ports back up.
To show maintenance mode status either run the NVUE nv show system maintenance command or the Linux sudo csmgrctl -s command:
cumulus@switch:~$ nv show system maintenance
operational
----- -----------
mode enabled
ports disabled
cumulus@switch:~$ sudo csmgrctl -s
Current System Mode: cold
frr : cold, up, up time: 00:14:51 (2 restarts)
clagd : cold, up, up time: 00:14:47
switchd : cold, up, up time: 01:09:48
System Services : cold, up, up time: 01:10:07
Maintenance Ports
Maintenance ports globally disables or enables all configured ports.
When you enable maintenance ports, swp interfaces follow individual admin states.
When you disable maintenance ports, swp interfaces are globally admin down, overriding the admin state in the configuration.
To enable maintenance ports:
cumulus@switch:~$ nv action enable system maintenance ports
Action executing ...
System maintenance ports has been enabled successfully
Current System Mode: cold
frr : cold, up, up time: 28:54:36
switchd : cold, up, up time: 28:54:40
System Services : cold, up, up time: 28:55:09
Action succeeded
cumulus@switch:~$ sudo csmgrctl -p0
To disable maintenance ports:
cumulus@switch:~$ nv action disable system maintenance ports
Action executing ...
System maintenance ports has been disabled successfully
Current System Mode: cold
Ports shutdown for Maintenance
frr : cold, up, up time: 28:55:49
switchd : cold, up, up time: 28:55:53
System Services : cold, up, up time: 28:56:22
Action succeeded
cumulus@switch:~$ sudo csmgrctl -p1
To see the status of maintenance ports, run the NVUE nv show system maintenance command:
cumulus@switch:~$ nv show system maintenance
operational
----- -----------
mode enabled
ports disabled
Layer 1 and Switch Ports
This section discusses the following layer 1 and switch port configuration:
To configure and bring an interface up administratively, edit the /etc/network/interfaces file to add the interface stanza, then run the ifreload -a command:
cumulus@switch:~$ sudo nano /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.1/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
...
To bring an interface down administratively after you configure it, add link-down yes to the interface stanza in the /etc/network/interfaces file, then run ifreload -a:
auto swp1
iface swp1
link-down yes
If you configure an interface in the /etc/network/interfaces file, you can bring it down administratively with the ifdown swp1 command, then bring the interface back up with the ifup swp1 command. These changes do not persist after a reboot. After a reboot, the configuration present in /etc/network/interfaces takes effect.
By default, the ifupdown and ifup command is quiet. Use the verbose option (-v) to show commands as they execute when you bring an interface down or up.
To remove an interface from the configuration entirely, remove the interface stanza from the /etc/network/interfaces file, then run the ifreload -a command.
For additional information on interface administrative state and physical state, refer to this knowledge base article.
Interface Classes
ifupdown2 enables you to group interfaces into separate classes. A class is a user-defined label that groups interfaces that share a common function (such as uplink, downlink or compute). You specify classes in the /etc/network/interfaces file.
The most common class is auto, which you configure like this:
auto swp1
iface swp1
You can add other classes using the allow prefix. For example, if you have multiple interfaces used for uplinks, you can define a class called uplinks:
auto swp1
allow-uplink swp1
iface swp1 inet static
address 10.1.1.1/31
auto swp2
allow-uplink swp2
iface swp2 inet static
address 10.1.1.3/31
This allows you to perform operations on only these interfaces using the --allow=uplinks option. You can still use the -a options because these interfaces are also in the auto class:
cumulus@switch:~$ sudo ifup --allow=uplinks
cumulus@switch:~$ sudo ifreload -a
If you are using Management VRF, you can use the special interface class called mgmt and put the management interface into that class. The management VRF must have an IPv6 address in addition to an IPv4 address to work correctly.
All ifupdown2 commands (ifup, ifdown, ifquery, ifreload) can take a class. Include the --allow=<class> option when you run the command. For example, to reload the configuration for the management interface described above, run:
cumulus@switch:~$ sudo ifreload --allow=mgmt
Use the -a option to bring up or down all interfaces with the common auto class in the /etc/network/interfaces file.
To administratively bring up all interfaces marked auto, run:
cumulus@switch:~$ sudo ifup -a
To administratively bring down all interfaces marked auto, run:
cumulus@switch:~$ sudo ifdown -a
To reload all network interfaces marked auto, use the ifreload command. This command is equivalent to running ifdown then ifup; however, ifreload skips unchanged configurations:
cumulus@switch:~$ sudo ifreload -a
Cumulus Linux checks syntax by default. As a precaution, apply configurations only if the syntax check passes. Use the following compound command:
cumulus@switch:~$ sudo bash -c "ifreload -s -a && ifreload -a"
For more information, see the individual man pages for ifup(8), ifdown(8), ifreload(8).
Loopback Interface
Cumulus Linux has a preconfigured loopback interface. When the switch boots up, the loopback interface called lo is up and assigned an IP address of 127.0.0.1.
The loopback interface lo must always exist on the switch and must always be up.
To configure an IP address for the loopback interface:
cumulus@switch:~$ nv set interface lo ip address 10.10.10.1
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file to add an address line:
auto lo
iface lo inet loopback
address 10.10.10.1
If the IP address has no subnet mask, it automatically becomes a /32 IP address. For example, 10.10.10.1 is 10.10.10.1/32.
You can configure multiple IP addresses for the loopback interface.
Subinterfaces
On Linux, an interface is a network device that can be either physical, (for example, swp1) or virtual (for example, vlan100). A VLAN subinterface is a VLAN device on an interface, and the VLAN ID appends to the parent interface using dot (.) VLAN notation. For example, a VLAN with ID 100 that is a subinterface of swp1 is swp1.100. The dot VLAN notation for a VLAN device name is a standard way to specify a VLAN device on Linux.
A VLAN subinterface only receives traffic tagged for that VLAN; therefore, swp1.100 only receives packets that have a VLAN 100 tag on switch port swp1. Any packets that transmit from swp1.100 have a VLAN 100 tag.
The following example configures a routed subinterface on swp1 in VLAN 100:
cumulus@switch:~$ nv set interface swp1.100 ip address 192.168.100.1/24
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file, then run ifreload -a:
If you are using a VLAN subinterface, do not add that VLAN under the bridge stanza.
You cannot use NVUE commands to create a routed subinterface for VLAN 1.
Interface IP Addresses
You can specify both IPv4 and IPv6 addresses for the same interface.
For IPv6 addresses:
You can create or modify the IP address for an interface using either :: or 0:0:0 notation. For example, both 2620:149:43:c109:0:0:0:5 and 2001:DB8::1/126 are valid.
Cumulus Linux assigns the IPv6 address with all zeroes in the interface identifier (2001:DB8::/126) for each subnet; connected hosts cannot use this address.
The following example commands configure three IP addresses for swp1; two IPv4 addresses and one IPv6 address.
cumulus@switch:~$ nv set interface swp1 ip address 10.0.0.1/30
cumulus@switch:~$ nv set interface swp1 ip address 10.0.0.2/30
cumulus@switch:~$ nv set interface swp1 ip address 2001:DB8::1/126
cumulus@switch:~$ nv config apply
In the /etc/network/interfaces file, list all IP addresses under the iface section.
auto swp1
iface swp1
address 10.0.0.1/30
address 10.0.0.2/30
address 2001:DB8::1/126
The address method and address family are not mandatory; they default to inet/inet6 and static. However, you must specify inet/inet6 when you are creating DHCP or loopback interfaces.
auto lo
iface lo inet loopback
To make non-persistent changes to interfaces at runtime, use ip addr add:
cumulus@switch:~$ sudo ip addr add 10.0.0.1/30 dev swp1
cumulus@switch:~$ sudo ip addr add 2001:DB8::1/126 dev swp1
To remove an addresses from an interface, use ip addr del:
cumulus@switch:~$ sudo ip addr del 10.0.0.1/30 dev swp1
cumulus@switch:~$ sudo ip addr del 2001:DB8::1/126 dev swp1
Interface Descriptions
You can add a description (alias) to an interface.
In the /etc/network/interfaces file, add a description using the alias keyword:
cumulus@switch:~# sudo nano /etc/network/interfaces
auto swp1
iface swp1
alias swp1 hypervisor_port_1
Interface Commands
You can specify user commands for an interface that run at pre-up, up, post-up, pre-down, down, and post-down.
You can add any valid command in the sequence to bring an interface up or down; however, limit the scope to network-related commands associated with the particular interface. For example, it does not make sense to install a Debian package on ifup of swp1, even though it is technically possible. See man interfaces for more details.
The following examples adds a command to an interface to enable proxy ARP:
If your post-up command also starts, restarts, or reloads any systemd service, you must use the --no-block option with systemctl. Otherwise, that service or even the switch itself might hang after starting or restarting. For example, to restart the dhcrelay service after bringing up a VLAN, the /etc network/interfaces configuration looks like this:
auto bridge.100
iface bridge.100
post-up systemctl --no-block restart dhcrelay.service
Source Interface File Snippets
Sourcing interface files helps organize and manage the /etc/network/interfaces file. For example:
cumulus@switch:~$ sudo cat /etc/network/interfaces
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
auto eth0
iface eth0 inet dhcp
source /etc/network/interfaces.d/bond0
Use the glob keyword to specify bridge ports and bond slaves:
auto br0
iface br0
bridge-ports glob swp1-6.100
auto br1
iface br1
bridge-ports glob swp7-9.100 swp11.100 glob swp15-18.100
Fast Linkup
Cumulus Linux supports fast linkup on interfaces on NVIDIA Spectrum1 switches. Fast linkup enables you to bring up ports with cards that require links to come up fast, such as certain 100G optical network interface cards.
You must configure both sides of the connection with the same speed and FEC settings.
cumulus@switch:~$ nv set interface swp1 link fast-linkup on
cumulus@switch:~$ nv config apply
Edit the /etc/cumulus/switchd.conf file and add the interface.<interface>.enable_media_depended_linkup_flow=TRUE and interface.<interface>.enable_port_short_tuning=TRUE settings for the interfaces on which you want to enable fast linkup. The following example enables fast linkup on swp1:
Reload switchd with the sudo systemctl reload switchd.service command.
Link Flap Protection
Cumulus Linux enables link flap detection by default. Link flap detection triggers when there are five link flaps within ten seconds, at which point the interface goes into a protodown state and shows linkflap as the reason. The switchd service also shows a log message similar to the following:
2023-02-10T17:53:21.264621+00:00 cumulus switchd[10109]: sync_port.c:2263 ERR swp2 link flapped more than 3 times in the last 60 seconds, setting protodown
To show interfaces with the protodown flag, run the Linux ip link command:
cumulus@switch:~$ ip link
37: swp2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9178 qdisc pfifo_fast master bond131 state DOWN mode DEFAULT group default qlen 1000
link/ether 1c:34:da:ba:bb:2a brd ff:ff:ff:ff:ff:ff protodown on protodown_reason <linkflap>
Clear the Interface Protodown State and Reason
The ifdown and ifup commands do not clear the protodown state. You must clear the protodown state and the reason manually using the sudo ip link set <interface> protodown_reason linkflap off and sudo ip link set <interface> protodown off commands.
cumulus@switch:~$ sudo ip link set swp2 protodown_reason linkflap off
cumulus@switch:~$ sudo ip link set swp2 protodown off
After a few seconds the port state returns to UP. Run the ip link show <interface> command to verify that the interface is no longer in a protodown state and that the reason is cleared:
cumulus@switch:~$ ip link show swp2
37: swp2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 9178 qdisc pfifo_fast master bond131 state UP mode DEFAULT group default qlen 1000
link/ether 1c:34:da:ba:bb:2a brd ff:ff:ff:ff:ff:ff
Change Link Flap Protection Settings
You can change link flap protection settings in the /etc/cumulus/switchd.conf file:
To change the duration during which a link must flap the number of times set in the link flap threshold before link flap protection triggers, change the link_flap_window setting.
To change the number of times the link must flap within the link flap window before link flap protection triggers, change the link_flap_threshold setting.
To disable link flap protection, set the link_flap_window and link_flap_threshold parameters to 0 (zero).
After you change the link flap settings, you must restart switchd with the sudo systemctl restart switchd.service command.
Mako Templates
ifupdown2 supports Mako-style templates. The Mako template engine processes the interfaces file before parsing.
Use the template to declare cookie-cutter bridges and to declare addresses in the interfaces file:
%for i in [1,12]:
auto swp${i}
iface swp${i}
address 10.20.${i}.3/24
In Mako syntax, use square brackets ([1,12]) to specify a list of individual numbers. Use range(1,12) to specify a range of interfaces.
To test your template and confirm it evaluates correctly, run mako-render /etc/network/interfaces.
To comment out content in Mako templates, use double hash marks (##). For example:
## % for i in range(1, 4):
## auto swp${i}
## iface swp${i}
## % endfor
##
Unlike the traditional ifupdown system, ifupdown2 does not run scripts installed in /etc/network/*/ automatically to configure network interfaces.
To enable or disable ifupdown2 scripting, edit the addon_scripts_support line in the /etc/network/ifupdown2/ifupdown2.conf file. 1 enables scripting and 2 disables scripting. For example:
cumulus@switch:~$ sudo nano /etc/network/ifupdown2/ifupdown2.conf
# Support executing of ifupdown style scripts.
# Note that by default python addon modules override scripts with the same name
addon_scripts_support=1
ifupdown2 sets the following environment variables when executing commands:
$IFACE represents the physical name of the interface; for example, br0 or vxlan42. The name comes from the /etc/network/interfaces file.
$LOGICAL represents the logical name (configuration name) of the interface.
$METHOD represents the address method; for example, loopback, DHCP, DHCP6, manual, static, and so on.
$ADDRFAM represents the address families associated with the interface in a comma-separated list; for example, "inet,inet6".
Troubleshooting
To see the link and administrative state of an interface:
cumulus@switch:~$ nv show interface swp1 link state
In the following example, swp1 is administratively UP and the physical link is UP (LOWER_UP).
cumulus@switch:~$ ip link show dev swp1
3: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
To show the assigned IP address on an interface:
cumulus@switch:~$ nv show interface swp1 ip address
cumulus@switch:~$ ip addr show swp1
3: swp1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
inet 192.0.2.1/30 scope global swp1
inet 192.0.2.2/30 scope global swp1
inet6 2001:DB8::1/126 scope global tentative
valid_lft forever preferred_lft forever
To show the description (alias) for an interface:
cumulus@switch$ nv show interface swp1
cumulus@switch$ ip link show swp1
3: swp1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT qlen 500
link/ether aa:aa:aa:aa:aa:bc brd ff:ff:ff:ff:ff:ff
alias hypervisor_port_1
Considerations
Even though ifupdown2 supports the inclusion of multiple iface stanzas for the same interface, use a single iface stanza for each interface. If you must specify more than one iface stanza; for example, if the configuration for a single interface comes from many places, like a template or a sourced file, make sure the stanzas do not specify the same interface attributes. Otherwise, you see unexpected behavior.
In the following example, swp1 is in two files: /etc/network/interfaces and /etc/network/interfaces.d/speed_settings. ifupdown2 parses this configuration because the same attributes are not in multiple iface stanzas.
cumulus@switch:~$ sudo cat /etc/network/interfaces
source /etc/network/interfaces.d/speed_settings
auto swp1
iface swp1
address 10.0.14.2/24
cumulus@switch:~$ cat /etc/network/interfaces.d/speed_settings
auto swp1
iface swp1
link-speed 1000
link-duplex full
ifupdown2 and sysctl
For sysctl commands in the pre-up, up, post-up, pre-down, down, and post-down lines that use the
$IFACE variable, if the interface name contains a dot (.), ifupdown2 does not change the name to work with sysctl. For example, the interface name bridge.1 does not convert to bridge/1.
ifupdown2 and the gateway Parameter
The default route that the gateway parameter creates in ifupdown2 does not install in FRR, therefore does not redistribute into other routing protocols. Define a static default route instead, which installs in FRR and redistributes, if needed.
The following shows an example of the /etc/network/interfaces file when you use a static route instead of a gateway parameter:
auto swp2
iface swp2
address 172.16.3.3/24
up ip route add default via 172.16.3.2
Interface Name Limitations
Interface names can be a maximum of 15 characters. You cannot use a number for the first character and you cannot include a dash (-) in the name. In addition, you cannot use any name that matches with the regular expression .{0,13}\-v.*.
If you encounter issues, remove the interface name from the /etc/network/interfaces file, then restart the networking.service.
ifupdown2 does not honor the configured IP address scope setting in the /etc/network/interfaces file and treats all addresses as global. It does not report an error. Consider this example configuration:
auto swp2
iface swp2
address 35.21.30.5/30
address 3101:21:20::31/80
scope link
When you run ifreload -a on this configuration, ifupdown2 considers all IP addresses as global.
cumulus@switch:~$ ip addr show swp2
5: swp2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 74:e6:e2:f5:62:82 brd ff:ff:ff:ff:ff:ff
inet 35.21.30.5/30 scope global swp2
valid_lft forever preferred_lft forever
inet6 3101:21:20::31/80 scope global
valid_lft forever preferred_lft forever
inet6 fe80::76e6:e2ff:fef5:6282/64 scope link
valid_lft forever preferred_lft forever
To work around this issue, configure the IP address scope:
The NVUE command is not supported.
In the /etc/network/interfaces file, configure the IP address scope using post-up ip address add <address> dev <interface> scope <scope>. For example:
auto swp6
iface swp6
post-up ip address add 71.21.21.20/32 dev swp6 scope site
Then run the ifreload -a command on this configuration.
The following configuration shows the correct scope:
cumulus@switch:~$ ip addr show swp6
9: swp6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
link/ether 74:e6:e2:f5:62:86 brd ff:ff:ff:ff:ff:ff
inet 71.21.21.20/32 scope site swp6
valid_lft forever preferred_lft forever
inet6 fe80::76e6:e2ff:fef5:6286/64 scope link
valid_lft forever preferred_lft forever
For NVIDIA Spectrum ASICs, the firmware configures FEC, link speed, duplex mode and auto-negotiation automatically, following a predefined list of parameter settings until the link comes up. You can disable FEC if necessary, which forces the firmware to not try any FEC options.
MTU
Interface MTU applies to traffic traversing the management port, front panel or switch ports, bridge, VLAN subinterfaces, and bonds (both physical and logical interfaces). MTU is the only interface setting that you must set manually.
In Cumulus Linux, ifupdown2 assigns 9216 as the default MTU setting. The initial MTU value set by the driver is 9238. After you configure the interface, the default MTU setting is 9216.
To change the MTU setting, run the following commands. The example command sets the MTU to 1500 for the swp1 interface.
cumulus@switch:~$ nv set interface swp1 link mtu 1500
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file, then run the ifreload -a command.
cumulus@switch:~$ sudo nano /etc/network/interfaces
auto swp1
iface swp1
mtu 1500
cumulus@switch:~$ sudo ifreload -a
Runtime Configuration (Advanced)
Run the ip link set command. The following example command sets the swp1 interface MTU to 1500.
cumulus@switch:~$ sudo ip link set dev swp1 mtu 1500
A runtime configuration is non-persistent; the configuration you create does not persist after you reboot the switch.
Set a Global Policy
To set a global MTU policy, create a policy document (called mtu.json). For example:
The policies and attributes in any file in /etc/network/ifupdown2/policy.d/ override the default policies and attributes in /var/lib/ifupdown2/policy.d/.
Bridge MTU
The MTU setting is the lowest MTU of any interface that is a member of the bridge (every interface specified in bridge-ports in the bridge configuration of the /etc/network/interfaces file). You are not required to specify an MTU on the bridge. Consider this bridge configuration:
For a bridge to have an MTU of 9000, set the MTU for each of the member interfaces (bond1 to bond 4, and peer5) to 9000 at minimum.
When configuring MTU for a bond, configure the MTU value directly under the bond interface; the member links or slave interfaces inherit the configured value. If you need a different MTU on the bond, set it on the bond interface, as this ensures the slave interfaces pick it up. You do not have to specify an MTU on the slave interfaces.
VLAN interfaces inherit their MTU settings from their physical devices or their lower interface; for example, swp1.100 inherits its MTU setting from swp1. Therefore, specifying an MTU on swp1 ensures that swp1.100 inherits the MTU setting for swp1.
If you are working with VXLANs, the MTU for a virtual network interface (VNI must be 50 bytes smaller than the MTU of the physical interfaces on the switch, as various headers and other data require those 50 bytes. Also, consider setting the MTU much higher than 1500.
To show the MTU setting for an interface:
cumulus@switch:~$ nv show interface swp1
...
link
auto-negotiate off on
duplex full full
speed 1G auto
fec auto
mtu 9216 9216
cumulus@switch:~$ ip link show dev swp1
3: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc pfifo_fast state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
Drop Packets that Exceed the Egress Layer 3 MTU
The switch forwards all packets that are within the MTU value set for the egress layer 3 interface. However, when packets are larger in size than the MTU value, the switch fragments the packets that do not have the DF bit set and drops the packets that do have the DF bit set.
Run the following command to drop all IP packets that are larger in size than the MTU value for the egress layer 3 interface instead of fragmenting packets:
cumulus@switch:~$ nv set system control-plane trap l3-mtu-err state off
cumulus@switch:~$ nv config apply
FEC is an encoding and decoding layer that enables the switch to detect and correct bit errors introduced over the cable between two interfaces. The target IEEE BER on high speed Ethernet links is 10-12. Because 25G transmission speeds can introduce a higher than acceptable BER on a link, FEC is often required to correct errors to achieve the target BER at 25G, 4x25G, 100G, and higher link speeds. The type and grade of a cable or module and the medium of transmission determine which FEC setting is necessary.
For the link to come up, the two interfaces on each end must use the same FEC setting.
FEC requires small latency overhead. For most applications, this small amount of latency is preferable to error packet retransmission latency.
The two FEC types are:
Reed Solomon (RS), IEEE 802.3 Clause 108 (CL108) on individual 25G channels and Clause 91 on 100G (4channels). This is the highest FEC algorithm, providing the best bit-error correction.
Base-R (BaseR), Fire Code (FC), IEEE 802.3 Clause 74 (CL74). Base-R provides less protection from bit errors than RS FEC but adds less latency.
Cumulus Linux includes additional FEC options:
Auto FEC instructs the hardware to select the best FEC. For copper DAC, the remote end can negotiate FEC. However, optical modules do not have auto-negotiation capability; if the device chooses a preferred mode, it might not match the remote end. This is the current default on the NVIDIA Spectrum switch.
No FEC (no error correction).
While Auto FEC is the default setting on the NVIDIA Spectrum switch, do not explicitly configure the fec auto option on the switch as this leads to a link flap whenever you run net commit or ifreload -a.
For 25G DAC, 4x25G Breakouts DAC and 100G DAC cables, the IEEE 802.3by specification creates 3 classes:
CA-25G-L (Long cable) - Requires RS FEC - Achievable cable length of at least 5m. dB loss less or equal to 22.48. Expected BER of 10-5 or better without RS FEC enabled.
CA-25G-S (Short cable) - Requires Base-R FEC - Achievable cable length of at least 3m. dB loss less or equal to 16.48. Expected BER of 10-8 or better without Base-R FEC enabled.
CA-25G-N (No FEC) - Does not require FEC - Achievable cable length of at least 3m. dB loss less or equal to 12.98. Expected BER 10-12 or better with no FEC enabled.
The IEEE classification specifies various dB loss measurements and minimum achievable cable length. You can build longer and shorter cables if they comply to the dB loss and BER requirements.
If a cable has a CA-25G-S classification and FEC is not on, the BER might be unacceptable in a production network. It is important to set the FEC according to the cable class (or better) to have acceptable bit error rates. See
Determining Cable Class below.
You can check bit errors using cl-netstat (RX_ERR column) or ethtool -S (HwIfInErrors counter) after a large amount of traffic passes through the link. A non-zero value indicates bit errors.
Expect error packets to be zero or extremely low compared to good packets. If a cable has an unacceptable rate of errors with FEC enabled, replace the cable.
For 25G, 4x25G Breakout, and 100G Fiber modules and AOCs, there is no classification of 25G cable types for dB loss, BER or length. Use FEC if the BER is low enough.
Cable Class of 100G and 25G DACs
You can determine the cable class for 100G and 25G DACs from the Extended Specification Compliance Code field (SFP28: 0Ah, byte 35, QSFP28: Page 0, byte 192) in the cable EEPROM programming.
For 100G DACs, most manufacturers use the 0x0Bh 100GBASE-CR4 or 25GBASE-CR CA-L value (the 100G DAC specification predates the IEEE 802.3by 25G DAC specification). Use RS FEC for 100G DAC; shorter or better cables might not need this setting.
A manufacturer’s EEPROM setting might not match the dB loss on a cable or the actual bit error rates that a particular cable introduces. Use the designation as a guide, but set FEC according to the bit error rate tolerance in the design criteria for the network. For most applications, the highest mutual FEC ability of both end devices is the best choice.
You can determine for which grade the manufacturer has designated the cable as follows.
In each example below, the Compliance field comes from the method described above; the ethool -m output does not show it.
3meter cable that does not require FEC
(CA-N)
Cost: More expensive
Cable size: 26AWG (Note that AWG does not necessarily correspond to overall dB loss or BER performance)
Compliance Code: 25GBASE-CR CA-N
3meter cable that requires Base-R FEC
(CA-S)
Cost: Less expensive
Cable size: 26AWG
Compliance Code: 25GBASE-CR CA-S
When in doubt, consult the manufacturer directly to determine the cable classification.
Spectrum ASIC FEC Behavior
The firmware in a Spectrum ASIC applies FEC configuration to 25G and 100G cables based on the cable type and whether the peer switch also has a Spectrum ASIC.
When the link is between two switches with Spectrum ASICs:
For 25G optical modules, the Spectrum ASIC firmware chooses Base-R/FC-FEC.
For 25G DAC cables with attenuation less or equal to 16db, the firmware chooses Base-R/FC-FEC.
For 25G DAC cables with attenuation higher than 16db, the firmware chooses RS-FEC.
For 100G cables/modules, the firmware chooses RS-FEC.
Cable Type
FEC Mode
25G optical cables
Base-R/FC-FEC
25G 1,2 meters: CA-N, loss <13db
Base-R/FC-FEC
25G 2.5,3 meters: CA-S, loss <16db
Base-R/FC-FEC
25G 2.5,3,4,5 meters: CA-L, loss > 16db
RS-FEC
100G DAC or optical
RS-FEC
When linking to a non-Spectrum peer, the firmware lets the peer decide. The Spectrum ASIC supports RS-FEC (for both 100G and 25G), Base-R/FC-FEC (25G only), or no-FEC (for both 100G and 25G).
Cable Type
FEC Mode
25G optical cables
Let peer decide
25G 1,2 meters: CA-N, loss <13db
Let peer decide
25G 2.5,3 meters: CA-S, loss <16db
Let peer decide
25G 2.5,3,4,5 meters: CA-L, loss > 16db
Let peer decide
100G
Let peer decide: RS-FEC or No FEC
How Does Cumulus Linux use FEC?
A Spectrum switch enables FEC automatically when it powers up. The port firmware tests and determines the correct FEC mode to bring the link up with the neighbor. It is possible to get a link up to a switch without enabling FEC on the remote device as the switch eventually finds a working combination to the neighbor without FEC.
The following sections describe how to show the current FEC mode, and how to enable and disable FEC.
Show the Current FEC Mode
To show the FEC mode on a switch port, run the NVUE nv show interface <interface> link command.
cumulus@switch:~$ nv show interface swp1 link
operational applied pending description
---------------- ------------ ------- ------- ----------------------------------------------------------------------
auto-negotiate off on on Link speed and characteristic auto negotiation
breakout 1x 1x sub-divide or disable ports (only valid on plug interfaces)
duplex full full full Link duplex
fec auto auto Link forward error correction mechanism
...
Enable or Disable FEC
To enable Reed Solomon (RS) FEC on a link:
cumulus@switch:~$ nv set interface swp1 link fec rs
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file, then run the ifreload -a command. The following example enables RS FEC for the swp1 interface (link-fec rs):
cumulus@switch:~$ sudo nano /etc/network/interfaces
auto swp1
iface swp1
link-autoneg off
link-speed 100000
link-fec rs
cumulus@switch:~$ sudo ifreload -a
Runtime Configuration (Advanced)
Run the ethtool --set-fec <interface> encoding RS command. For example:
A runtime configuration is non-persistent. The configuration you create does not persist after you reboot the switch.
To enable Base-R/FireCode FEC on a link:
cumulus@switch:~$ nv set interface swp1 link fec baser
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file, then run the ifreload -a command. The following example enables Base-R FEC for the swp1 interface (link-fec baser):
cumulus@switch:~$ sudo nano /etc/network/interfaces
auto swp1
iface swp1
link-autoneg off
link-speed 100000
link-fec baser
cumulus@switch:~$ sudo ifreload -a
Runtime Configuration (Advanced)
Run the ethtool --set-fec <interface> encoding baser command. For example:
A runtime configuration is non-persistent. The configuration you create does not persist after you reboot the switch.
To enable FEC with Auto-negotiation:
You can use FEC with auto-negotiation on DACs only.
cumulus@switch:~$ nv set interface swp1 link auto-negotiate on
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file to set auto-negotiation to on, then run the ifreload -a command:
cumulus@switch:~$ sudo nano /etc/network/interfaces
auto swp1
iface swp1
link-autoneg on
cumulus@switch:~$ sudo ifreload -a
Runtime Configuration (Advanced)
You can use ethtool to enable FEC with auto-negotiation. For example:
ethtool -s swp1 speed 10000 duplex full autoneg on
A runtime configuration is non-persistent. The configuration you create does not persist after you reboot the switch.
To show the FEC and auto-negotiation settings for an interface, either run the NVUE nv show interface <interface> link command or the Linux sudo ethtool swp1 | egrep 'FEC|auto' command:
cumulus@switch:~$ nv set interface swp1 link fec off
cumulus@switch:~$ nv config apply
To configure FEC to the default value, run the nv unset interface swp1 link fec command.
Edit the /etc/network/interfaces file, then run the ifreload -a command. The following example disables Base-R FEC for the swp1 interface (link-fec baser):
cumulus@switch:~$ sudo nano /etc/network/interfaces
auto swp1
iface swp1
link-fec off
cumulus@switch:~$ sudo ifreload -a
Runtime Configuration (Advanced)
Run the ethtool --set-fec <interface> encoding off command. For example:
cumulus@switch:~$ sudo ethtool --set-fec swp1 encoding off
A runtime configuration is non-persistent. The configuration you create does not persist after you reboot the switch.
DR1 and DR4 Modules
100GBASE-DR1 modules, such as NVIDIA MMS1V70-CM, include internal RS FEC processing, which the software does not control. When using these optics, you must either set the FEC setting to off or leave it unset for the link to function.
400GBASE-DR4 modules, such as NVIDIA MMS1V00-WM, require RS FEC. The switch automatically enables FEC if it is set to off.
You typically use these optics to interconnect 4x SN2700 uplinks to a single SN4700 breakout downlink. The following configuration shows an explicit FEC example. You can leave the FEC settings unset for autodetection.
SN4700 (400GBASE-DR4 in swp1):
cumulus@SN4700:mgmt:~$ nv set interface swp1 link breakout 4x lanes-per-port 2
cumulus@SN4700:mgmt:~$ nv set interface swp1s0 link fec rs
cumulus@SN4700:mgmt:~$ nv set interface swp1s0 link speed 100G
cumulus@SN4700:mgmt:~$ nv set interface swp1s1 link fec rs
cumulus@SN4700:mgmt:~$ nv set interface swp1s1 link speed 100G
cumulus@SN4700:mgmt:~$ nv set interface swp1s2 link fec rs
cumulus@SN4700:mgmt:~$ nv set interface swp1s2 link speed 100G
cumulus@SN4700:mgmt:~$ nv set interface swp1s3 link fec rs
cumulus@SN4700:mgmt:~$ nv set interface swp1s3 link speed 100G
cumulus@SN4700:mgmt:~$ nv config apply
SN2700 (100GBASE-DR1 in swp11-14):
cumulus@SN2700:mgmt:~$ nv set interface swp11 link fec off
cumulus@SN2700:mgmt:~$ nv set interface swp11 link speed 100G
cumulus@SN2700:mgmt:~$ nv set interface swp12 link fec off
cumulus@SN2700:mgmt:~$ nv set interface swp12 link speed 100G
cumulus@SN2700:mgmt:~$ nv set interface swp13 link fec off
cumulus@SN2700:mgmt:~$ nv set interface swp13 link speed 100G
cumulus@SN2700:mgmt:~$ nv set interface swp14 link fec off
cumulus@SN2700:mgmt:~$ nv set interface swp14 link speed 100G
cumulus@SN4700:mgmt:~$ nv config apply
The FEC operational view of this configuration appears incorrect because FEC is operationally enabled only on the SN4700 400G breakout side. This is because the 100G DR1 module side handles FEC internally, which is not visible to Cumulus Linux.
cumulus@SN2700:mgmt:~$ nv show int swp11 link
operational applied
--------------------- ----------------- -------
auto-negotiate on on
duplex full full
speed 100G auto
fec off off
mtu 9216 9216
fast-linkup off
[breakout]
state up up
...
cumulus@SN4700:mgmt:~$ nv show int swp1s1 link
operational applied
--------------------- ----------------- -------
auto-negotiate on on
duplex full full
speed 100G auto
fec rs off
mtu 9216 9216
fast-linkup off
[breakout]
state up up
...
Default Policies for Interface Settings
Instead of configuring settings for each individual interface, you can specify a policy for all interfaces on a switch or tailor custom settings for each interface. Create a file in /etc/network/ifupdown2/policy.d/ and populate the settings accordingly. The following example shows a file called address.json.
Setting the default MTU also applies to the management interface. Be sure to add the iface_defaults to override the MTU for eth0, to remain at 9216.
Breakout Ports
Cumulus Linux supports the following ports breakout options:
18x SFP28 25G and 4x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.
All 4x QSFP28 ports can break out into 4x SFP28 or 2x QSFP28.
18x 1G - 18x SFP28 set to 1G
16x 1G - 4x QSFP28 configured as 4x breakouts and set to 1G
Max 1G ports: 34
18x 10G - 18x SFP28 set to 10G
16x 10G - 4x QSFP28 configured as 4x breakouts and set to 10G
Maximum 10G ports: 34
18x 25G - 18x SFP28 (native speed)
16x 25G - 4x QSFP28 breakouts to 4x and set to 25G
Maximum 25G ports: 34
4x 40G - 4x QSFP28 set to 40G
Maximum 40G ports: 4
8x 50G - 4x QSFP28 break out into 2x and set to 50G
Maximum 50G ports: 8
4x 100G - 4x QSFP28 (native speed)
Maximum 100G ports: 4
16x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.
All QSFP28 ports can break out into 4x SFP28 or 2x QSFP28.
64x 1G - 16x QSFP28 break out into 4x and set to 1G
Max 1G ports: 64
64x 10G - 16x QSFP28 break out into 4x and set to 10G
Maximum 10G ports: 64
64x 25G - 16x QSFP28 break out into 4x and set to 25G
Maximum 25G ports: 64
16x 40G - 4x QSFP28 set to 40G
Maximum 40G ports: 16
32x 50G - 16x QSFP28 break out into 2x and set to 50G
Maximum 50G ports: 32
16x 100G - 16x QSFP28 (native speed)
Maximum 100G ports: 16
48x 1GBase-T ports (RJ45 up to 100m CAT5E/6) and 4x QSFP28 100G interfaces (only support NRZ encoding). You can set all speeds down to 1G.
All 4x QSFP28 ports can break out into 4x SFP28 or 2x QSFP28.
48x 1GBase-T - 48x Base-T set to 1G. You can set them to also to 10/100Mb.
16x 1G - 4x QSFP28 configured as 4x breakouts and set to 1G
Maximum 10/100MBase-T ports: 48
Maximum 1GBase-T ports: 48
Maximum 1G ports: 16
16x 10G - 4x QSFP28 configured as 4x breakouts and set to 10G
Maximum 10G ports: 16
16x 25G - 4x QSFP28 breakouts to 4x and set to 25G
Maximum 25G ports: 16
4x 40G - 4x QSFP28 set to 40G
Maximum 40G ports: 4
8x 50G - 4x QSFP28 break out into 2x
Maximum 50G ports: 8
4x 100G - 4x QSFP28 (native speed)
Maximum 100G ports: 4
48x SFP28 25G and 8x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.
The top 4x QSFP28 ports can break out into 4x SFP28. You cannot use the lower 4x QSFP28 disabled ports.
All 8x QSFP28 ports can break out into 2x QSFP28 without disabling ports.
48x 1G - 48x SFP28 set to 10G
16x 1G - 4x QSFP28 break out into 4x and set to 1G
Max 1G ports: 64
48x 10G - 48x SFP28 set to 10G
16x 10G - 4x QSFP28 break out into 4x and set to 10G
Maximum 10G ports: 64
48x 25G - 48x SFP28 (native speed)
16x 25G - Top 4x QSFP28 break out into 4x (bottom 4x QSFP28 disabled)
Maximum 25G ports: 64
8x 40G - 8x QSFP28 set to 40G
Maximum 40G ports: 8
16x 50G - 8x QSFP28 break out into 2x
Maximum 50G ports: 16
8x 100G - 8x QSFP28 (native speed)
Maximum 100G ports: 8
32x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.
The top 16x QSFP28 ports can break out into 4x SFP28. You cannot use the lower 4x QSFP28 disabled ports.
All 32x QSFP28 ports can break out into 2x QSFP28 without disabling ports.
64x 1G - Top 16x QSFP28 break out into 4x and set to 1G (bottom 16XQSFP28 disabled)
Max 1G ports: 64
64x 10G - Top 16x QSFP28 break out into 4x and set to 10G (bottom 16x QSFP28 disabled)
Maximum 10G ports: 64
64x 25G - Top 16x QSFP28 break out into 4x (bottom 16x QSFP28 disabled)
Maximum 25G ports: 64
32x 40G - 32x QSFP28 set to 40G
Maximum 40G ports: 32
64x 50G - 64x QSFP28 break out into 2x
Maximum 50G ports: 64
32x 100G - 32x QSFP28 (native speed)
Maximum 100G ports: 32
48x SFP28 25G and 12x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.
All 12x QSFP28 ports can break out into 4x SFP28 or 2x QSFP28.
48x 1G - 48XSFP28 set to 1G
48x 1G - 12XQSFP28 break out into 4x and set to 1G
Max 1G ports: 96
48x 10G - 48x SFP28 set to 10G
48x 10G - 12x QSFP28 break out into 4x and set to 10G
Maximum 10G ports: 96
48x 25G - 48x SFP28 (native speed)
48x 25G - 12x QSFP28 break out into 4x
Maximum 25G ports: 96
12x 40G - 12x QSFP28 set to 40G
Maximum 40G ports: 12
24x 50G - 12x QSFP28 break out into 2x
Maximum 50G ports: 24
12x 100G - 12x QSFP28 (native speed)
Maximum 100G ports: 12
32x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.
All 32x QSFP28 ports can break out into 4x SFP28 or 2x QSFP28.
128x 1G - 32XQSFP28 break out into 4x and set to 1G
Max 1G ports: 128
128x 10G - 32x QSFP28 break out into 4x and set to 10G
Maximum 10G ports: 128
128x 25G - 32x QSFP28 break out into 4x
Maximum 25G ports: 128
32x 40G - 32x QSFP28 set to 40G
Maximum 40G ports: 32
64x 50G - 32x QSFP28 break out into 2x
Maximum 50G ports: 64
32x 100G - 32x QSFP28 (native speed)
Maximum 100G ports: 32
32x QSFP56 200G interfaces support both PAM4 and NRZ encodings. You can set all speeds down to 1G.
For lower speed interface configurations, PAM4 is automatically converted to NRZ encoding.
All 32x QSFP56 ports can break out into 4xSFP56 or 2x QSFP56.
128x 1G - 32XQSFP56 break out into 4x and set to 1G
Max 1G ports: 128
128x 10G - 32x QSFP56 break out into 4x and set to 10G
Maximum 10G ports: 128
128x 25G - 32x QSFP56 break out into 4x and set to 25G
Maximum 25G ports: 128
32x 40G - 32x QSFP56 set to 40G
Maximum 40G ports: 32
128x 50G - 32x QSFP56 break out into 4x
Maximum 50G ports: 128
64x 100G - 32x QSFP56 break out into 2x
Maximum 100G ports: 64
32x 200G - 32x QSFP56 (native speed)
Maximum 200G ports: 32
SN4410 24xQSFP28-DD (100GbE) interfaces [ports 1-24] only support NRZ encoding with all speeds down to 1G.
The 8xQSFP-DD (400GbE) interfaces [ports 25-32] support both PAM4 and NRZ encodings with all speeds down to 1G.
For lower speeds, PAM4 is automatically converted to NRZ encoding.
The 24xQSFP28-DD ports can break out into 2xQSFP28 (2x100GbE) using special 2x100GbE breakout cable, or 4xSFP28 (4x25GbE).
The top 4xQSFP-DD ports can break out into 8xSFP56 (8x50GbE). But, in this case, the adjacent 4xQSFP-DD ports are blocked.
All the 8xQSFP-DD ports can break out into 4xQSFP56 (4x100GbE), or 2xQSFP56 (2x200GbE) without blocking ports.
96x 10G - 24XQSFP28-DD break out into 4x and set to 1G
32x 10G - Top 4XQSFP-DD break out into 8x and set to 1G (bottom 4XQSFP-DD blocked*)
Max 1G ports: 128
96x10G - 24xQSFP28-DD break out into 4x and set to 10G
32x10G - 4 top QSFP-DD break out into 8x and set to 10G (bottom 4xQSFP-DD blocked*)
Maximum 10G ports: 128
*Other QSFP-DD breakout combinations are available up to maximum of 128x ports.
96x25G - 24xQSFP28-DD break out into 4x
32x25G - 4 top QSFP-DD break out into 8x and set to 25G (bottom 4xQSFP-DD blocked*)
Maximum 25G ports: 128
*Other QSFP-DD breakout combinations are available up to maximum of 128x ports.
32x40G - 24xQSFP28-DD and 8xQSFP-DD set to 40G
Maximum 40G ports: 32
48x50G - 24xQSFP28-DD break out into 2x
32x50G - 4 top QSFP-DD break out into 8x (bottom 4xQSFP-DD blocked*)
Maximum 50G ports: 80
*Other QSFP-DD breakout combinations are available up to maximum of 80x ports.
48x100G - 24xQSFP28-DD break out into 2x (using special 2xQSFP28-DD breakout cable)
32x100G - 8xQSFP-DD break out into 4x
Maximum 100G ports: 80
16x200G - 8xQSFP-DD break out into 2x
Maximum 200G ports: 16
8x400G - 8xQSFP-DD (native speed)
Maximum 400G ports: 8
64x QSFP28 100G interfaces only support NRZ encoding. You can set all speeds down to 1G.
Only 32x QSFP28 ports can break out into 4x SFP28. You must disable the adjacent QSFP28 port. Only the first and third or second and forth rows can break out into 4xSFP28.
All 64x QSFP28 ports can break out into 2x QSFP28 without disabling ports.
128x 1G - 32XQSFP28 break out into 4x and set to 1G
Max 1G ports: 128
128x 10G - 32x QSFP28 break out into 4x and set to 10G
Maximum 10G ports: 128
128x 25G - 32x QSFP28 break out into 4x
Maximum 25G ports: 128
64x 40G - 64x QSFP28 set to 40G
Maximum 40G ports: 64
128x 50G - 64x QSFP28 break out into 2x
Maximum 50G ports: 128
64x 100G - 64x QSFP28 (native speed)
Maximum 100G ports: 64
SN4600 64xQSFP56 (200GbE) interfaces support both PAM4 and NRZ encodings with all speeds down to 1G.
For lower speeds, PAM4 is automatically converted to NRZ encoding.
Only 32xQSFP56 ports can break out into 4xSFP56 (4x50GbE). But, in this case, the adjacent QSFP56 port are blocked (only the first and third or second and fourth rows can break out into 4xSFP56).
All 64xQSFP56 ports can break out into 2xQSFP56 (2x100GbE) without blocking ports.
128x 1G - 32XQSFP56 break out into 4x and set to 1G
Max 1G ports: 128
128x10G - 64xQSFP56 break out into 4x and set to 10G
Maximum 10G ports: 128
128x25G - 64xQSFP56 break out into 4x and set to 25G
Maximum 25G ports: 128
64x40G - 64xQSFP56 set to 40G
Maximum 40G ports: 64
128x50G - 32xQSFP56 break out into 4x
Maximum 50G ports: 128
128x100G - 64xQSFP56 break out into 2x
64x100G - 64xQSFP28 set to 100G
Maximum 100G ports: 128
64x200G - 64xQSFP56 (native speed)
Maximum 200G ports: 64
SN4700 32x QSFP-DD 400GbE interfaces support both PAM4 and NRZ encodings. You can set all speeds down to 1G.
For lower speed interface configurations, PAM4 is automatically converted to NRZ encoding.
Only the top 16x QSFP-DD ports can break out into 8x SFP56. You must disable the adjacent QSFP-DD port.
All 32x QSFP-DD ports can break out into 2x QSFP56 at 2x200G or 4x QSFP56 at 4x 100G without disabling ports.
128x 1G - Top 16XQSFP-DD break out into 8x and set to 1G
Maximum 1G ports: 128
128x 10G - 16x QSFP-DD break out into 8x and set to 10G
Maximum 10G ports: 128
*Cumulus Linux supports other QSFP-DD breakout combinations up to maximum of 128x ports.
128x 25G - 16x QSFP-DD break out into 8x and set to 25G
Maximum 25G ports: 128
*Cumulus Linux supports other QSFP-DD breakout combinations up to maximum of 128x ports.
32x 40G - 32x QSFP-DD set to 40G
Maximum 40G ports: 32
128x 50G - 16x QSFP-DD break out into 8x
Maximum 50G ports: 128
*Cumulus Linux supports other QSFP-DD breakout combinations up to maximum of 128x ports.
128x 100G - 32x QSFP-DD break out into 4x
Maximum 100G ports: 128
64x 200G - 64x QSFP-DD break out into 2x
Maximum 200G ports: 64
32x 400G - 32x QSFP-DD (native speed)
Maximum 400G ports: 32
You can use a single SFP (10/25/50G) transceiver in a QSFP (100/200/400G) port with QSFP-to-SFP Adapter (QSA). Set the port speed to the SFP speed with the nv set interface <interface> link speed <speed> command. Do not configure this port as a breakout port.
If you break out a port, then reload the switchd service on a switch running in nonatomic ACL mode, temporary disruption to traffic occurs while the ACLs reinstall.
Cumulus Linux does not support port ganging.
Configure a Breakout Port
You can break out (split) a port using the following options:
1x does not split the port. This is the default port setting.
2x splits the port into two interfaces.
4x splits the port into four interfaces.
8x splits the port into eight interfaces.
If you split a 100G port into four interfaces and auto-negotiation is on (the default setting), Cumulus Linux advertises the speed for each interface up to the maximum speed possible for a 100G port (100/4=25G). You can overide this configuration and set specific speeds for the split ports if necessary.
Cumulus Linux 5.4 and later uses a new format for port splitting; instead of 1=100G or 1=4x10G, you specify 1=1x or 1=4x. The new format does not support specifying a speed for breakout ports in the /etc/cumulus/ports.conf file. To set a speed, either set the link-speed parameter for each split port in the /etc/network/interfaces file or run the NVUE nv set interface <interface> link speed <speed> command.
The following example breaks out a 100G port on swp1 into four interfaces. Cumulus Linux advertises the speed for each interface up to a maximum of 25G:
cumulus@switch:~$ nv set interface swp1 link breakout 4x
cumulus@switch:~$ nv set interface swp1s0-3 link state up
cumulus@switch:~$ nv config apply
The following example splits the port into four interfaces and forces the link speed to be 10G. Cumulus disables auto-negotiation when you force set the speed.
cumulus@switch:~$ nv set interface swp1 link breakout 4x
cumulus@switch:~$ nv set interface swp1s0-3 link state up
cumulus@switch:~$ nv set interface swp1s0-3 link speed 10G
Certain switches, such as the SN2700, SN4600, and SN4600c, require that you disable the subsequent even-numbered port when you configure a breakout port for 4x or 8x. NVUE automatically disables the subsequent even-numbered port on any switch with this requirement.
To split a port into multiple interfaces, edit the /etc/cumulus/ports.conf file. The following example command breaks out swp1 into four interfaces.
When you configure a breakout port to 4x or 8x on certain switches such as the SN2700, SN4600, and SN4600c, you must set the subsequent even-numbered port to disabled in the /etc/cumulus/ports.conf file. The SN3700, SN3700c, SN2201, SN2010, and SN2100 switch does not have this requirement.
Reload switchd with the sudo systemctl reload switchd.service command. The reload does not interrupt network services.
To configure specific speeds for the split ports, edit the /etc/network/interfaces file, then run the ifreload -a command. The following example configures the speed for each swp1 breakout port (swp1s0, swp1s1, swp1s2, and swp1s3) to 10G with auto-negotiation off.
cumulus@switch:~$ sudo cat /etc/network/interfaces
...
auto swp1s0
iface swp1s0
link-speed 10000
link-duplex full
link-autoneg off
auto swp1s1
iface swp1s1
link-speed 10000
link-duplex full
link-autoneg off
auto swp1s2
iface swp1s2
link-speed 10000
link-duplex full
link-autoneg off
auto swp1s3
iface swp1s3
link-speed 10000
link-duplex full
link-autoneg off
...
cumulus@switch:~$ sudo ifreload -a
The SN4700 and SN4410 switch does not support auto-negotiation on QSFP-DD 400G transceiver modules. You need to force set the speed.
Set the Number of Lanes per Split Port
By default, to calculate the split port width, Cumulus Linux uses the formula split port width = full port width / breakout. For example, a port split into two interfaces (2x breakout) => 8 lanes width / 2x breakout = 4 lanes per split port.
If you need to use a different port width than the default, you can set the number of lanes per port.
QSFP56-DD transceiver ports split into four interfaces (4x) default to one lane per interface for backwards compatibility. You can change the lane setting to two lanes per interface.
The following example command splits swp1 into two interfaces (2x) and sets the number of lanes per split port to 2.
cumulus@switch:~$ nv set interface swp1 link breakout 2x lanes-per-port 2
cumulus@switch:~$ nv config apply
You must configure the lanes-per-port at the same time as you configure the breakout. If you want to change the number of lanes per port after you configure a breakout, you must first unset the breakout with the nv unset interface <port> breakout and nv config apply commands, then reconfigure the breakout and the lanes with the nv set interface <interface> link breakout <breakout> lanes-per-port <lanes> command. For example:
cumulus@switch:~$ nv unset interface swp1 link breakout
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv set interface swp1 link breakout 2x lanes-per-port 2
cumulus@switch:~$ nv config apply
Edit the /etc/cumulus/ports_width.conf file and add the numer of lanes per split port you want to use, then reload switchd:
You must configure the lanes per port in the /etc/cumulus/ports_width.conf before you configure the breakout in the /etc/cumulus/ports.conf file. If the ports.conf file already contains breakout configuration for a port, you must set the breakout back to 1x, then reload switchd. You can then set the desired lanes per port, then reconfigure the breakout.
Remove the breakout interface configuration from the /etc/network/interfaces file, then run the ifreload -a command.
Configure Port Lanes
You can override the default behavior for supported speeds and platforms and specify the number of lanes for a port. For example, for the NVIDIA SN4700 switch, the default port speed is 50G (2 lanes, NRZ signaling mode) and 100G (4 lanes, NRZ signaling mode). You can override this setting to 50G (1 lane, PAM4 signaling mode) and 100G (2 lanes, PAM4 signaling mode).
This setting does not apply when auto-negotiation is on because Cumulus Linux advertises all supported speed options, including PAM4 and NRZ during auto-negotiation.
cumulus@switch:~$ nv set interface swp1 link speed 50G
cumulus@switch:~$ nv set interface swp1 link lanes 1
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv set interface swp2 link speed 100G
cumulus@switch:~$ nv set interface swp2 link lanes 2
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file, then run the ifreload -a command.
Cumulus Linux includes a ports.conf validator that switchd runs automatically before the switch starts up to confirm that the file syntax is correct. You can run the validator manually to verify the syntax of the file whenever you make changes. The validator is useful if you want to copy a new ports.conf file to the switch with automation tools, then validate that it has the correct syntax.
To run the validator manually, run the /usr/cumulus/bin/validate-ports -f <file> command. For example:
To verify SFP settings, run the NVUE nv show interface <interface> pluggable command or the ethtool -m command. The following example shows the vendor, type and power output for swp1.
cumulus@switch:~$ sudo ethtool -m swp1 | egrep 'Vendor|type|power\s+:'
Transceiver type : 10G Ethernet: 10G Base-LR
Vendor name : FINISAR CORP.
Vendor OUI : 00:90:65
Vendor PN : FTLX2071D327
Vendor rev : A
Vendor SN : UY30DTX
Laser output power : 0.5230 mW / -2.81 dBm
Receiver signal average optical power : 0.7285 mW / -1.38 dBm
Considerations
Auto-negotiation and FEC
If auto-negotiation is off on 100G and 25G interfaces, you must set FEC to OFF, RS, or BaseR to match the neighbor. The FEC default setting of auto does not link up when auto-negotiation is off.
Auto-negotiation and Link Speed
If auto-negotiation is on and you set the link speed for a port, Cumulus Linux disables auto-negotiation and uses the port speed setting you configure.
1000BASE-T SFP Modules Supported Only on Certain 25G Platforms
The following 25G switches support 1000BASE-T SFP modules:
NVIDIA SN2410
NVIDIA SN2010
100G or faster switches do not support 1000BASE-T SFP modules.
NVIDIA SN2100 Switch and eth0 Link Speed
After rebooting the NVIDIA SN2100 switch, eth0 always has a speed of 100MB per second. If you bring the interface down and then back up again, the interface negotiates 1000MB. This only occurs the first time the interface comes up.
To work around this issue, add the following commands to the /etc/rc.local file to flap the interface automatically when the switch boots:
modprobe -r igb
sleep 20
modprobe igb
Delay in Reporting Interface as Operational Down
When you remove two transceivers simultaneously from a switch, both interfaces show the carrier down status immediately. However, it takes one second for the second interface to show the operational down status. In addition, the services on this interface also take an extra second to come down.
NVIDIA Spectrum-2 Switches and FEC Mode
The NVIDIA Spectrum-2 (25G) switch only supports RS FEC.
ifplugd is an Ethernet link-state monitoring daemon that executes scripts to configure an Ethernet device when you plug in or remove a cable. Follow the steps below to install and configure the ifplugd daemon.
Install ifplugd
You can install this package even if the switch does not connect to the internet. The package is in the cumulus-local-apt-archive repository on the Cumulus Linux image.
To install ifplugd:
Update the switch before installing the daemon:
cumulus@switch:~$ sudo -E apt-get update
Install the ifplugd package:
cumulus@switch:~$ sudo -E apt-get install ifplugd
Configure ifplugd
After you install ifplugd, you must edit two configuration files:
/etc/default/ifplugd
/etc/ifplugd/action.d/ifupdown
The example configuration below configures ifplugd to bring down all uplinks when the peer bond goes down in an MLAG environment.
Open /etc/default/ifplugd in a text editor and configure the file as appropriate. Add the peerbond name before you save the file.
Open the /etc/ifplugd/action.d/ifupdown file in a text editor. Configure the script, then save the file.
#!/bin/sh
set -e
case "$2" in
up)
clagrole=$(clagctl | grep "Our Priority" | awk '{print $8}')
if [ "$clagrole" = "secondary" ]
then
#List all the interfaces below to bring up when clag peerbond comes up.
for interface in swp1 bond1 bond3 bond4
do
echo "bringing up : $interface"
ip link set $interface up
done
fi
;;
down)
clagrole=$(clagctl | grep "Our Priority" | awk '{print $8}')
if [ "$clagrole" = "secondary" ]
then
#List all the interfaces below to bring down when clag peerbond goes down.
for interface in swp1 bond1 bond3 bond4
do
echo "bringing down : $interface"
ip link set $interface down
done
fi
;;
esac
Restart the ifplugd daemon to implement the changes:
The default shell for ifplugd is dash (/bin/sh) instead of bash, as it provides a faster and more nimble shell. However, dash contains fewer features than bash (for example, dash is unable to handle multiple uplinks).
Quality of Service
This section refers to frames for all internal QoS functionality. Unless explicitly stated, the actions are independent of layer 2 frames or layer 3 packets.
Cumulus Linux supports several different QoS features and standards including:
Cumulus Linux uses two configuration files for QoS:
/etc/cumulus/datapath/qos/qos_features.conf includes all standard QoS configuration, such as marking, shaping and flow control.
/etc/mlx/datapath/qos/qos_infra.conf includes all platform specific configurations, such as buffer allocations and Alpha values.
Cumulus Linux 5.0 and later does not use the traffic.conf and datapath.conf files but uses the qos_features.conf and qos_infra.conf files instead. Before upgrading Cumulus Linux, review your existing QoS configuration to determine the changes you need to make.
switchd and QoS
When you run Linux commands to configure QoS, you must apply QoS changes to the ASIC with the following command:
Unlike the restart command, the reload switchd.service command does not impact traffic forwarding except when the qos_infra.conf file changes, or when the switch pauses frames or controls priority flow, which require modifications to the ASIC buffer and might result in momentary packet loss.
NVUE reloads the switchd service automatically. You do not have to run the reload switchd.service command to apply changes when configuring QoS with NVUE commands.
Classification
When a frame or packet arrives on the switch, Cumulus Linux maps it to an internal COS (switch priority) value. This value never writes to the frame or packet but classifies and schedules traffic internally through the switch.
You can define which values are trusted: 802.1p, DSCP, or both.
The following table describes the default classifications for various frame and switch priority configurations:
Setting
VLAN Tagged?
IP or Non-IP
Result
PCP (802.1p)
Yes
IP
Accept incoming 802.1p marking.
PCP (802.1p)
Yes
Non-IP
Accept incoming 802.1p marking.
PCP (802.1p)
No
IP
Use the default priority setting.
PCP (802.1p)
No
Non-IP
Use the default priority setting.
DSCP
Yes
IP
Accept incoming DSCP IP header marking.
DSCP
Yes
Non-IP
Use the default priority setting.
DSCP
No
IP
Accept incoming DSCP IP header marking.
DSCP
No
Non-IP
Use the default priority setting.
PCP (802.1p) and DSCP
Yes
IP
Accept incoming DSCP IP header marking.
PCP (802.1p) and DSCP
Yes
Non-IP
Accept incoming 802.1p marking.
PCP (802.1p) and DSCP
No
IP
Accept incoming DSCP IP header marking.
PCP (802.1p) and DSCP
No
Non-IP
Use the default priority setting.
port
Either
Either
Ignore any existing markings and use the default priority setting.
If you use NVUE to configure QoS, you define which values are trusted with the nv set qos mapping <profile> trust l2 command (802.1p) or the nv set qos mapping <profile> trust l3 command (DSCP) .
If you use Linux commands to configure QoS, you define which values are trusted in the /etc/cumulus/datapath/qos/qos_features.conf file by configuring the traffic.packet_priority_source_set setting to 802.1p or dscp.
Trust 802.1p Marking
To trust 802.1p marking:
When 802.1p (l2) is trusted, Cumulus Linux classifies these ingress 802.1p values to switch priority values:
Switch Priority
802.1p (PCP)
0
0
1
1
2
2
3
3
4
4
5
5
6
6
7
7
The PCP number is the incoming 802.1p marking; for example PCP 0 maps to switch priority 0.
To change the default profile to map PCP 0 to switch priority 4:
If you configure the trust to be l2 but do not specify any PCP to switch priority mappings, Cumulus Linux uses the default values.
To show the ingress 802.1p mapping for the default profile, run the nv show qos mapping default-global pcp command. To show the PCP mapping for a specific switch priority in the default profile, run the nv show qos mapping default-global pcp <value> command. The following example shows that PCP 0 maps to switch priority 4:
You can map multiple ingress DSCP values to the same switch priority value. For example, to change the default profile to map ingress DSCP values 10, 21, and 36 to switch priority 0:
If you configure the trust to be l3 but do not specify any DSCP to switch priority mappings, Cumulus Linux uses the default values.
To show the DSCP mapping in the default profile, run the nv show qos mapping default-global dscp command. To show the DSCP mapping for a specific switch priority in the default profile, run the nv show qos mapping default-global dscp <value> command. The following example shows that DSCP 22 maps to switch priority 4:
The # in the configuration file is a comment. By default, the file comments out the traffic.cos_*.priority_source.dscp lines.
You must uncomment them for them to take effect.
The traffic.cos_ number is the switch priority value; for example DSCP values 0 through 7 map to switch priority 0. To map ingress DSCP 22 to switch priority 4, configure the traffic.cos_4.priority_source.dscp setting.
traffic.cos_4.priority_source.dscp = [22]
You can map multiple ingress DSCP values to the same switch priority value. For example, to map ingress DSCP values 10, 21, and 36 to switch priority 0:
traffic.cos_0.priority_source.dscp = [10,21,36]
You can also choose not to use an switch priority value. This example does not use switch priority values 3 and 4:
To apply a custom DSCP profile to specific interfaces, see Port Groups.
Trust Port
You can assign all traffic to a switch priority regardless of the ingress marking.
The following commands assign all traffic to switch priority 3 regardless of the ingress marking.
cumulus@switch:~$ nv set qos mapping default-global trust port
cumulus@switch:~$ nv set qos mapping default-global port-default-sp 3
cumulus@switch:~$ nv config apply
To show the switch priority setting in the default profile for all traffic regardless of the ingress marking, run the nv show qos mapping default-global command:
cumulus@switch:~$ nv show qos mapping default-global
operational applied description
--------------- ----------- ------- ----------------------------
port-default-sp 3 3 Port Default Switch Priority
trust port port Port Trust configuration
In the /etc/cumulus/datapath/qos/qos_features.conf file, configure traffic.packet_priority_source_set = [port].
The traffic.port_default_priority setting defines the switch priority that all traffic uses.
To apply a custom profile to specific interfaces, see Port Groups.
Mark and Remark Traffic
You can mark or remark traffic in two ways:
Use ingress COS or DSCP to remark an existing 802.1p COS or DSCP value to a new value.
Use iptables to match packets and set 802.1p COS or DSCP values (policy-based marking).
802.1p or DSCP for Marking
To enable global remarking of 802.1p, DSCP or both 802.1p and DSCP values:
In the /etc/cumulus/datapath/qos/qos_features.conf file, modify the traffic.packet_priority_remark_set value to [802.1p], [dscp] or [802.1p,dscp]. For example, to enable the remarking of only 802.1p values:
traffic.packet_priority_remark_set = [802.1p]
You remark 802.1p or DSCP with the priority_remark.8021p or priority_remark.dscp setting. The switch priority (internal cos_) value determines the egress 802.1p or DSCP remarking. For example, to remark switch priority 0 to egress 802.1p 4:
traffic.cos_0.priority_remark.8021p = [4]
To remark switch priority 0 to egress DSCP 22:
traffic.cos_0.priority_remark.dscp = [22]
The # in the configuration file is a comment. The file comments out the traffic.cos_*.priority_remark.8021p and the traffic.cos_*.priority_remark.dscp lines by default. You must uncomment them to set the configuration.
You can remap multiple switch priority values to the same external 802.1p or DSCP value. For example, to map switch priority 1 and 2 to 802.1p 3:
To apply a custom profile to specific interfaces, see Port Groups.
Policy-based Marking
Cumulus Linux supports ACLs through ebtables, iptables or ip6tables for egress packet marking and remarking.
Cumulus Linux uses ebtables to mark layer 2, 802.1p COS values.
Cumulus Linux uses iptables to match IPv4 traffic and ip6tables to match IPv6 traffic for DSCP marking.
For more information on configuring and applying ACLs, refer to Netfilter - ACLs.
Mark Layer 2 COS
You must use ebtables to match and mark layer 2 bridged traffic. You can match traffic with any supported ebtables rule.
To set the new 802.1p COS value when traffic matches, use -A FORWARD -o <interface> -j setqos --set-cos <value>.
You can only set COS on a per-egress interface basis. Cumulus Linux does not support ebtables based matching on ingress.
The configured action always has the following conditions:
The rule is always part of the FORWARD chain.
The interface (<interface>) is a physical swp port.
The jump action is always setqos (lowercase).
The --set-cos value is a 802.1p COS value between 0 and 7.
For example, to set traffic leaving interface swp5 to 802.1p COS value 4:
-A FORWARD -o swp5 -j setqos --set-cos 4
Mark Layer 3 DSCP
You must use iptables (for IPv4 traffic) or ip6tables (for IPv6 traffic) to match and mark layer 3 traffic.
You can match traffic with any supported iptable or ip6tables rule.
To set the new COS or DSCP value when traffic matches, use -A FORWARD -o <interface> -j SETQOS [--set-dscp <value> | --set-cos <value> | --set-dscp-class <name>].
The configured action always has the following conditions:
The rule is always configured as part of the FORWARD chain.
The interface (<interface>) is a physical swp port.
The jump action is always SETQOS (uppercase).
You can configure COS markings with --set-cos and a value between 0 and 7 (inclusive).
You can use only one of --set-dscp or --set-dscp-class. --set-dscp supports decimal or hex DSCP values between 0 and 77.
--set-dscp-class supports standard DSCP naming, described in RFC3260, including ef, be, CS and AF classes.
You can specify either --set-dscp or --set-dscp-class, but not both.
For example, to set traffic leaving interface swp5 to DSCP value 32:
-A FORWARD -o swp5 -j SETQOS --set-dscp 32
To set traffic leaving interface swp11 to DSCP class value CS6:
-A FORWARD -o swp11 -j SETQOS --set-dscp-class cs6
Flow Control
Flow control influences data transmission to manage congestion along a network path.
Cumulus Linux supports the following flow control mechanisms:
Pause Frames (IEEE 802.3x), sends specialized ethernet frames to an adjacent layer 2 switch to stop or pauseall traffic on the link during times of congestion.
Priority Flow Control (PFC), which is an upgrade of Pause Frames that IEEE 802.1bb defines, extends the pause frame concept to act on a per-COS value basis instead of an entire link. A PFC pause frame indicates to the peer which specific COS value to pause, while other COS values or queues continue transmitting.
You can not configure link pause and PFC on the same port.
Flow Control Buffers
Before configuring pause frames or PFC, configure the buffer pool memory allocated for lossless and lossy flows. The following example sets each to fifty percent:
cumulus@switch:~$ nv set qos traffic-pool default-lossless memory-percent 50
cumulus@switch:~$ nv set qos traffic-pool default-lossy memory-percent 50
cumulus@switch:~$ nv config apply
Cumulus Linux allocates 100% of the buffer memory to the default-lossy traffic pool by default. The total memory allocation across pools must not exceed 100%.
Edit the following lines in the /etc/mlx/datapath/qos/qos_infra.conf file:
Modify the existing ingress_service_pool.0.percent and egress_service_pool.0.percent buffer allocation. Change the existing ingress setting to ingress_service_pool.0.percent = 50. Change the existing egress setting to egress_service_pool.0.percent = 50.
Add the following lines to create a new service_pool, set flow_control to the service pool, and define buffer reservations:
Pause frames are an older flow control mechanism that causes all traffic on a link between two switches, or between a host and switch, to stop transmitting during times of congestion. Pause frames start and stop depending on buffer congestion. You configure pause frames on a per-direction, per-interface basis. You can receive pause frames to stop the switch from transmitting when requested, send pause frames to request neighboring devices to stop transmitting, or both.
NVIDIA recommends that you use Priority Flow Control (PFC) instead of pause frames.
Before configuring pause frames, you must first modify the switch buffer allocation. Refer to Flow Control Buffers.
Pause frame buffer calculation is a complex topic that IEEE 802.1Q-2012 defines. This attempts to incorporate the delay between signaling congestion and the reception of the signal by the neighboring device. This calculation includes the delay that the PHY and MAC layers (interface delay) introduce as well as the distance between end points (cable length).
Incorrect cable length settings can cause wasted buffer space (triggering congestion too early) or packet drops (congestion occurs before flow control activates).
The following example configuration:
Creates a profile (port group) called my_pause_ports.
Enables sending pause frames and disables receiving pause frames.
Sets the cable length to 50 meters.
Sets link pause on swp1 through swp4, and swp6.
Cumulus Linux also includes frame transmission start and stop threshold, and port buffer settings. NVIDIA recommends that you do not change these settings but, instead, let Cumulus Linux configure the settings dynamically. Only change the threshold and buffer settings if you are an advanced user who understands the buffer configuration requirements for lossless traffic to work seamlessly.
cumulus@switch:~$ nv set qos link-pause my_pause_ports tx enable
cumulus@switch:~$ nv set qos link-pause my_pause_ports rx disable
cumulus@switch:~$ nv set qos link-pause my_pause_ports cable-length 50
cumulus@switch:~$ nv set interface swp1-swp4,swp6 qos link-pause profile my_pause_ports
cumulus@switch:~$ nv config apply
To show the pause frame settings for a profile, run the nv show qos link-pause <profile> command
Uncomment and edit the link_pause section of the /etc/cumulus/datapath/qos/qos_features.conf file.
To process pause frames, you must enable link pause on the specific interfaces.
Priority Flow Control (PFC)
Priority flow control extends the capabilities of pause frames by the frames for a specific 802.1p value instead of stopping all traffic on a link. If a switch supports PFC and receives a PFC pause frame for a given 802.1p value, the switch stops transmitting frames from that queue, but continues transmitting frames for other queues.
You use PFC with RDMA over Converged Ethernet - RoCE. The RoCE section provides information to specifically deploy PFC and ECN for RoCE environments.
Before configuring PFC, first modify the switch buffer allocation according to Flow Control Buffers.
PFC buffer calculation is a complex topic defined in IEEE 802.1Q-2012, which attempts to incorporate the delay between signaling congestion and receiving the signal by the neighboring device. This calculation includes the delay that the PHY and MAC layers (called the interface delay) introduce as well as the distance between end points (cable length).
Incorrect cable length settings cause wasted buffer space (triggering congestion too early) or packet drops (congestion occurs before flow control activates).
To apply PFC settings on all ports, modify the default PFC profile (default-global).
The following example modifies the default profile and configures:
PFC on egress queue 0.
Enables sending pause frames and disables receiving pause frames.
The cable length to 50 meters.
Cumulus Linux also includes frame transmission start and stop threshold, and port buffer settings. NVIDIA recommends that you do not change these settings but, instead, let Cumulus Linux configure the settings dynamically. Only change the threshold and buffer settings if you are an advanced user who understands the buffer configuration requirements for lossless traffic to work seamlessly.
cumulus@switch:~$ nv set qos pfc default-global switch-priority 0
cumulus@switch:~$ nv set qos pfc default-global tx enable
cumulus@switch:~$ nv set qos pfc default-global rx disable
cumulus@switch:~$ nv set qos pfc default-global cable-length 50
cumulus@switch:~$ nv config apply
To show the PFC settings for the default profile, run the nv show qos pfc default-global command:
cumulus@switch:~$ nv show qos pfc default-global
operational applied description
----------------- ----------- ------- --------------------------------
cable-length 50 50 Cable Length (in meters)
port-buffer 25000 B 25000 B Port Buffer (in bytes)
rx disable disable PFC Rx State
tx enable enable PFC Tx State
xoff-threshold 10000 B 10000 B Xoff Threshold (in bytes)
xon-threshold 2000 B 2000 B Xon Threshold (in bytes)
[switch-priority] 0 0 Collection of switch priorities.
Edit the priority flow control section of the /etc/cumulus/datapath/qos/qos_features.conf file.
To apply a custom profile to specific interfaces, see Port Groups.
Congestion Control (ECN)
Explicit Congestion Notification (ECN) is an end-to-end layer 3 congestion control protocol. Defined by RFC 3168, ECN relies on bits in the IPv4 header Traffic Class to signal congestion conditions. ECN requires one or both server endpoints to support ECN to be effective.
ECN operates by having a transit switch that marks packets between two end hosts.
The transmitting host indicates it is ECN-capable by setting the ECN bits in the outgoing IP header to 01 or 10
If the buffer of a transit switch is greater than the configured minimum threshold of the buffer, the switch remarks the ECN bits to 11 indicating Congestion Encountered or CE.
The receiving host marks any reply packets, like a TCP-ACK, as CE (11).
The original transmitting host reduces its transmission rate.
When the switch buffer congestion falls below the configured minimum threshold of the buffer, the switch stops remarking ECN bits, setting them back to 01 or 10.
A receiving host reflects this new ECN marking in the next reply so that the transmitting host resumes sending at normal speeds.
The default profile (default-global) enables ECN by default on egress queue 0 for all ports with the following settings:
A minimum buffer threshold of 150000 bytes. Random ECN marking starts when buffer congestion crosses this threshold and ECN marking probability ramps up as the queue depth increases towards the maximum threshold value.
A maximum buffer threshold of 1500000 bytes. Cumulus Linux marks all ECN-capable packets when buffer congestion crosses this threshold.
Random Early Detection (RED) disabled. ECN prevents packet drops in the network due to congestion by signaling hosts to transmit less. However, if congestion continues after ECN marking, packets drop after the switch buffer is full. By default, Cumulus Linux tail-drops packets when the buffer is full. You can enable RED to drop packets that are in the queue randomly instead of always dropping the last arriving packet. This might improve overall performance of TCP based flows.
The following example commands change the default ECN profile that applies to all ports. The commands enable ECN on egress queue 4, 5, and 7, set the minimum buffer threshold to 40000 and the maximum buffer threshold to 200000, and enable RED.
cumulus@switch:~$ nv set qos congestion-control default-global traffic-class 4,5,7 min-threshold 40000
cumulus@switch:~$ nv set qos congestion-control default-global traffic-class 4,5,7 max-threshold 200000
cumulus@switch:~$ nv set qos congestion-control default-global traffic-class 4,5,7 red enable
cumulus@switch:~$ nv config apply
The following example disables ECN bit marking in the default profile for all ports.
To show the ECN settings for the default profile, run the nv show qos congestion-control default-global command:
cumulus@switch:~$ nv show qos congestion-control default-global
operational applied description
-- ----------- ------- -----------
ECN Configurations
=====================
traffic-class ECN RED Min Th Max Th Probability
------------- ------ ------ ------- -------- -----------
4 enable enable 40000 B 200000 B 100
5 enable enable 40000 B 200000 B 100
7 enable enable 40000 B 200000 B 100
To show the ECN settings in the default profile for a specific egress queue, run the nv show qos congestion-control default-global traffic-class <value> command:
cumulus@switch:~$ nv show qos congestion-control default-global traffic-class 4
operational applied description
------------- ----------- -------- -----------------------------------
ecn enable enable Early Congestion Notification State
max-threshold 200000 B 200000 B Maximum Threshold (in bytes)
min-threshold 40000 B 40000 B Minimum Threshold (in bytes)
probability 100 100 Probability
red enable enable Random Early Detection State
Edit the Explicit Congestion Notification section of the /etc/cumulus/datapath/qos/qos_features.conf file.
To disable ECN bit marking, set ecn_enable to false. The following example disables ECN bit marking in the default profile for all ports.
...
default_ecn_red_conf.ecn_enable = false
...
To apply a custom ECN profile to specific interfaces, see Port Groups.
Egress Queues
Cumulus Linux supports eight egress queues to provide different classes of service. By default switch priority values map directly to the matching egress queue. For example, switch priority value 0 maps to egress queue 0.
You can remap queues by changing the switch priority value to the corresponding queue value. You can map multiple switch priority values to a single egress queue.
You do not have to assign all egress queues.
The following command examples assign switch priority 2 to egress queue 7:
To show the egress queue mapping for a specific switch priority in the default profile, run the nv show qos egress-queue-mapping default-global switch-priority <value> command. The following example command shows that switch priority 2 maps to egress queue 7.
cumulus@switch:~$ nv show qos egress-queue-mapping default-global switch-priority 2
operational applied description
------------- ----------- ------- -------------
traffic-class 7 7 Traffic Class
You configure egress queues in the qos_infra.conf file.
Cumulus Linux supports 802.1Qaz, Enhanced Transmission Selection, which allows the switch to assign bandwidth to egress queues and then schedule the transmission of traffic from each queue. 802.1Qaz supports Priority Queuing.
Cumulus Linux provides a default egress scheduler that applies to all ports, where the bandwidth allocated to egress queues 0,2,4,6 is 12 percent and the bandwidth allocated to egress queues 1,3,5,7 is 13 percent. You can also apply a custom egress scheduler for specific ports; see Port Groups.
The following example modifies the default profile. The commands change the bandwidth allocation for egress queues 0, 1, 5, and 7 to strict, bandwidth allocation for egress queues 2 and 6 to 30 percent and bandwidth allocation for egress queues 3 and 4 to 20 percent.
The traffic-class value defines the egress queue where you want to assign bandwidth. For example, traffic-class 2 defines the bandwidth allocation for egress queue 2.
For each egress queue, you can either define the mode as dwrr or strict. In dwrr mode, you must define a bandwidth percent value between 1 and 100. If you do not specify a value for an egress queue, Cumulus Linux uses a DWRR value of 0 (no egress scheduling). The combined total of values you assign to bw_percent must be less than or equal to 100.
You configure the egress scheduling policy in the egress scheduling section of the /etc/cumulus/datapath/qos/qos_features.conf file.
The egr_queue_ value defines the egress queue where you want to assign bandwidth. For example, egr_queue_0 defines the bandwidth allocation for egress queue 0.
The bw_percent value defines the bandwidth allocation you want to assign to an egress queue. If you do not specify a value for an egress queue, there is no egress scheduling. If you specify a value of 0 for an egress queue, Cumulus Linux assigns strict priority mode to the egress queue and always processes it ahead of other queues. The combined total of values you assign to bw_percent must be less than or equal to 100.
strict mode does not define a maximum bandwidth allocation. This can lead to starvation of other queues.
To apply a custom egress scheduler for specific ports, see Port Groups.
Policing and Shaping
Traffic shaping and policing control the rate at which the switch sends or receives traffic on a network to prevent congestion.
Traffic shaping typically occurs at egress and traffic policing at ingress.
Shaping
Traffic shaping allows a switch to send traffic at an average bitrate lower than the physical interface. Traffic shaping prevents a receiving device from dropping bursty traffic if the device is either not capable of that rate of traffic or has a policer that limits what it accepts.
Traffic shaping works by holding packets in the buffer and releasing them at specific time intervals.
Cumulus Linux supports two levels of hierarchical traffic shaping: one at the egress queue level and one at the port level. This allows for minimum and maximum bandwidth guarantees for each egress queue and a defined port traffic shaping rate.
The following example configuration:
Sets the profile name (port group) to use with the traffic shaping settings to shaper1.
Sets the minimum bandwidth for egress queue 2 to 100 kbps. The default minimum bandwidth is 0 kbps.
Sets the maximum bandwidth for egress queue 2 to 500 kbps. The default minimum bandwidth is 2147483647 kbps.
Sets the maximum packet shaper rate for the port group to 200000. The default maximum packet shaper rate is 2147483647 kbps.
Applies the traffic shaping configuration to swp1, swp2, swp3, and swp5.
When the minimum bandwidth for an egress queue is 0, there is no bandwidth guarantee for this queue.
The maximum bandwidth for an egress queue must not exceed the maximum packet shaper rate for the port group.
The maximum packet shaper rate for the port group must not exceed the physical interface speed.
Cumulus Linux only shapes traffic for the traffic classes in a profile that include shaper configuration.
Traffic policing prevents an interface from receiving more traffic than intended. You use policing to enforce a maximum transmission rate on an interface. The switch drops any traffic above the policing level.
Cumulus Linux supports both a single-rate policer and a dual-rate policer (tricolor policer).
You configure traffic policing using ebtables, iptables, or ip6table rules.
For more information on configuring and applying ACLs, refer to Netfilter - ACLs.
Single-rate Policer
To configure a single-rate policer, use iptables JUMP action -j POLICE.
Cumulus Linux supports the following iptable flags with a single-rate policer.
iptables Flag
Description
--set-mode [pkt | KB]
Define the policer to count packets or kilobytes.
--set-rate [<kbytes> | <packets>]
The maximum rate of traffic in kilobytes or packets per second.
--set-burst <kilobytes>
The allowed burst size in kilobytes.
For example, to create a policer to allow 400 packets per second with 100 packet burst: -j POLICE --set-mode pkt --set-rate 400 --set-burst 100
Dual-rate Policer
To configure a dual-rate policer, use the iptables JUMP action -j TRICOLORPOLICE.
Cumulus Linux supports the following iptable flags with a dual-rate policer.
iptables Flag
Description
--set-color-mode [blind | aware]
The policing mode: single-rate (blind) or dual-rate (aware). The default is aware.
--set-cir <kbps>
The committed information rate (CIR) in kilobits per second.
--set-cbs <kbytes>
The committed burst size (CBS) in kilobytes.
--set-pir <kbps>
The peak information rate (PIR) in kilobits per second.
--set-ebs <kbytes>
The excess burst size (EBS) in kilobytes.
--set-conform-action-dscp <dscp value>
The numerical DSCP value to mark for traffic that conforms to the policer rate.
--set-exceed-action-dscp <dscp value>
The numerical DSCP value to mark for traffic that exceeds the policer rate.
--set-violate-action-dscp <dscp value>
The numerical DSCP value to mark for traffic that violates the policer rate.
--set-violate-action [accept | drop]
Cumulus Linux either accepts and remarks, or drops packets that violate the policer rate.
For example, to configure a dual-rate, three-color policer, with a 3 Mbps CIR, 500 KB CBS, 10 Mbps PIR, and 1 MB EBS and drops packets that violate the policer:
Cumulus Linux supports profiles (port groups) for all features including ECN and RED. Profiles apply similar QoS configurations to a set of ports.
Configurations with a profile override the global settings for the ingress ports in the port group.
Ports not in a profile use the global settings.
To apply a profile to all ports, use the global profile.
Trust and Marking
You can use port groups to assign different profiles to different ports. A profile is a label for a group of configuration settings.
The following example configures two profiles. customer1 applies to swp1, swp4, and swp6. customer2 applies to swp5 and swp7.
cumulus@switch:~$ nv set qos mapping customer1 trust l3
cumulus@switch:~$ nv set qos mapping customer1 dscp 0 switch-priority 1-7
cumulus@switch:~$ nv set interface swp1,swp4,swp6 qos mapping profile customer1
cumulus@switch:~$ nv set qos mapping customer2 trust l2
cumulus@switch:~$ nv set qos mapping customer2 pcp 1 switch-priority 4
cumulus@switch:~$ nv set interface swp5,swp7 qos mapping profile customer2
cumulus@switch:~$ nv config apply
The following example configures the profile customports, which assigns traffic on swp1, swp2, and swp3 to switch priority 4 regardless of the ingress marking.
cumulus@switch:~$ nv set qos mapping customports trust port
cumulus@switch:~$ nv set qos mapping customports port-default-sp 4
cumulus@switch:~$ nv set interface swp1,swp2,swp3 qos mapping profile customports
cumulus@switch:~$ nv config apply
You define profiles with the source.port_group_list configuration in the qos_features.conf file. A source.port_group_list is one or more names used for a group of settings.
The following example configures two profiles. customer1 applies to swp1, swp4, and swp6. customer2 applies to swp5 and swp7.
The names of the port groups (profiles) you want to use. The following example defines customer1 and customer2: source.port_group_list = [customer1,customer2]
source.customer1.packet_priority_source_set
The ingress marking trust. In the following example, ingress DSCP values are for group customer1: source.customer1.packet_priority_source_set = [dscp]
source.customer1.port_set
The set of ports on which to apply the ingress marking trust policy. In the following example, ports swp1, swp2, swp3, swp4, and swp6 are for customer1: source.customer1.port_set = swp1-swp4,swp6
source.customer1.port_default_priority
The default switch priority marking for unmarked or untrusted traffic. In the following example, Cumulus Linux marks unmarked traffic or layer 2 traffic for customer1 ports with switch priority 0: source.customer1.port_default_priority = 0
source.customer1.cos_0.priority_source
The ingress DSCP values to a switch priority value mapping for customer1. In the following example, the set of DSCP values from 0 through 7 map to switch priority 0: source.customer1.cos_0.priority_source.dscp = [0,1,2,3,4,5,6,7]
source.customer2.packet_priority_source_set
The ingress marking trust for customer2. In the following example, 802.1p is trusted: source.packet_priority_source_set = [802.1p]
source.customer2.port_set
The set of ports on which to apply the ingress marking trust policy. In the following example, swp5 and swp7 apply for customer2: source.customer2.port_set = swp5,swp7
source.customer2.port_default_priority
The default switch priority marking for unmarked or untrusted traffic. In the following example, Cumulus Linux marks unmarked tagged layer 2 traffic or unmarked VLAN tagged traffic for customer1 ports with switch priority 0: source.customer2.port_default_priority = 0
source.customer2.cos_0.priority_source
The switch priority value to an ingress 802.1p value mapping for customer2. The following example maps ingress 802.1p value 4 to switch priority 1: source.customer2.cos_1.priority_source.8021p = [4]
The following example configures the profile customports, which assigns traffic on swp1, swp2, and swp3 to switch priority 4 regardless of the ingress marking.
You can use profiles to remark 802.1p or DSCP on egress according to the switch priority (internal COS) value.
To change the marked value on a packet, the switch ASIC reads the enable or disable rewrite flag on the ingress port and refers to the mapping configuration on the egress port to change the marked value. To remark 802.1p or DSCP values, you have to enable the rewrite on the ingress port and configure the mapping on the egress port.
In the following example configuration, only packets that ingress on swp1 and egress on swp2 change the marked value of the packet. Packets that ingress on other ports and egress on swp2 do not change the marked value of the packet. The commands map switch priority 0 and 1 to egress DSCP 37.
cumulus@switch:~$ nv set qos remark remark_port_group1 rewrite l3
cumulus@switch:~$ nv set interface swp1 qos remark profile remark_port_group1
cumulus@switch:~$ nv set qos remark remark_port_group2 switch-priority 0 dscp 37
cumulus@switch:~$ nv set qos remark remark_port_group2 switch-priority 1 dscp 37
cumulus@switch:~$ nv set interface swp2 qos remark profile remark_port_group2
cumulus@switch:~$ nv config apply
You define these profiles with remark.port_group_list in the /etc/cumulus/datapath/qos/qos_features.conf file. The name is a label for configuration settings.
You can use port groups with egress scheduling weights to assign different profiles to different egress ports.
In the following example, the profile list2 applies to swp1, swp3, and swp18. list2 only assigns weights to queues 2, 5, and 6, and schedules the other queues on a best-effort basis when there is no congestion in queues 2, 5, or 6. list1 applies to swp2 and assigns weights to all queues.
You define port groups with egress_sched.port_group_list in the /etc/cumulus/datapath/qos/qos_features.conf file. An egress_sched.port_group_list includes the names for the group settings. The name is a label (profile) for the configuration settings.
The names of the port groups (labels) to use. The following example defines port groups list1 snd list2: egress_sched.port_group_list = [list1,list2]
egress_sched.list1.port_set
The interfaces on which you want to apply the port group. egress_sched.list1.port_set = swp2
egress_sched.list1.egr_queue_0.bw_percent
The percentage of bandwidth for egress queue 0. egress_sched.list1.egr_queue_0.bw_percent = 10
egress_sched.list1.egr_queue_1.bw_percent
The percentage of bandwidth for egress queue 1. egress_sched.list1.egr_queue_1.bw_percent = 20
egress_sched.list1.egr_queue_2.bw_percent
The percentage of bandwidth for egress queue 2. egress_sched.list1.egr_queue_2.bw_percent = 30
egress_sched.list1.egr_queue_3.bw_percent
The percentage of bandwidth for egress queue 3. egress_sched.list1.egr_queue_3.bw_percent = 10
egress_sched.list1.egr_queue_4.bw_percent
The percentage of bandwidth for egress queue 4. egress_sched.list1.egr_queue_4.bw_percent = 10
egress_sched.list1.egr_queue_5.bw_percent
The percentage of bandwidth for egress queue 5.
egress_sched.list1.egr_queue_5.bw_percent = 10
egress_sched.list1.egr_queue_6.bw_percent
The percentage of bandwidth for egress queue 6. egress_sched.list1.egr_queue_6.bw_percent = 10
egress_sched.list1.egr_queue_7.bw_percent
The percentage of bandwidth for egress queue 7. 0 indicates a strict priority queue: egress_sched.list1.egr_queue_7.bw_percent = 0
egress_sched.list2.port_set
The interfaces you want to apply to the port group. The following example applies swp1, swp3 and swp18 to port group list2: egress_sched.list2.port_set = [swp1,swp3,swp18]
egress_sched.list2.egr_queue_2.bw_percent
The percentage of bandwidth for egress queue 2. egress_sched.list2.egr_queue_2.bw_percent = 50
egress_sched.list2.egr_queue_5.bw_percent
The percentage of bandwidth for egress queue 5. egress_sched.list2.egr_queue_5.bw_percent = 50
egress_sched.list2.egr_queue_6.bw_percent
The percentage of bandwidth for egress queue 6. 0 indicates a strict priority queue: egress_sched.list2.egr_queue_6.bw_percent = 0
PFC
To set priority flow control on a group of ports, you create a profile to define the egress queues that support sending PFC pause frames and define the set of interfaces to which you want to apply PFC pause frame configuration. Cumulus Linux automatically enables PFC frame transmit and PFC frame receive, and derives all other PFC settings, such as the buffer limits that trigger PFC frames transmit to start and stop, the amount of reserved buffer space, and the cable length.
The following example applies a PFC profile called my_pfc_ports for egress queue 3 and 5 on swp1, swp2, swp3, swp4, and swp6.
The following example applies a PFC profile called my_pfc_ports2 for egress queue 0 on swp1. The commands disable PFC frame receive, and set the buffer limit that triggers PFC frame transmission to stop to 1500 bytes and to start to 1000 bytes. The commands also set the amount of reserved buffer space to 2000 bytes, and the cable length to 50 meters:
cumulus@switch:~$ nv set qos pfc my_pfc_ports2 switch-priority 0
cumulus@switch:~$ nv set qos pfc my_pfc_ports2 xoff-threshold 1500
cumulus@switch:~$ nv set qos pfc my_pfc_ports2 xon-threshold 1000
cumulus@switch:~$ nv set qos pfc my_pfc_ports2 tx enable
cumulus@switch:~$ nv set qos pfc my_pfc_ports2 rx disable
cumulus@switch:~$ nv set qos pfc my_pfc_ports2 port-buffer 2000
cumulus@switch:~$ nv set qos pfc my_pfc_ports2 cable-length 50
cumulus@switch:~$ nv set interface swp1 qos pfc profile my_pfc_ports2
cumulus@switch:~$ nv config apply
All PFC commands
Command
Description
nv set qos pfc <profile> port-buffer <value>
The amount of reserved buffer space (from the global shared buffer) for the interfaces defined in the port group list . The following example sets the amount of reserved buffer space to 25000 bytes: nv set qos pfc my_pfc_ports port-buffer 25000
nv set qos pfc <profile> xoff-threshold <value>
The amount of reserved buffer that the switch must consume before sending a PFC pause frame out of the set of interfaces in the port group list. The following example sends PFC pause frames after consuming 20000 bytes of reserved buffer: nv set qos pfc my_pfc_ports xoff-threshold 20000
nv set qos pfc <profile> xon-threshold <value>
The number of bytes below the xoff threshold that the buffer consumption must drop below before sending PFC pause frames stops. In the following example, the buffer congestion must reduce by 1000 bytes (to 8000 bytes) before PFC pause frames stop: nv set qos pfc my_pfc_ports xon-threshold 1000
nv set qos pfc <profile> rx enable nv set qos pfc <profile> rx disable
Enables or disables sending PFC pause frames. The default value is enable. The following example disables sending PFC pause frames: nv set qos pfc my_pfc_ports rx disable
nv set qos pfc <profile> tx enable nv set qos pfc <profile> tx disable
Enables or disables receiving PFC pause frames. You do not need to define the COS values for rx enable. The switch receives any COS value. The default value is enable. The following example disables receiving PFC pause frames: nv set qos pfc my_pfc_ports tx disable
nv set qos pfc <profile> cable-length <value>
The length, in meters, of the cable that attaches to the ports. Cumulus Linux uses this value internally to determine the latency between generating a PFC pause frame and receiving the PFC pause frame. The default is 10 meters. The following example sets the cable length to 5 meters: nv set qos pfc my_pfc_ports cable-length 5
Edit the priority flow control section of the /etc/cumulus/datapath/qos/qos_features.conf file.
The following example applies a PFC profile called my_pfc_ports for egress queue 3 and 5 on swp1, swp2, swp3, swp4, and swp6.
The following example applies a PFC profile called my_pfc_ports2 for egress queue 0 on swp1. The commands also disable PFC frame receive, and set the xoff-size to 1500 bytes, the xon-size to 1000 bytes, the headroom to 2000 bytes, and the cable length to 10 meters:
The amount of reserved buffer space (from the global shared buffer) for the interfaces defined in the port group list. The following example sets the amount of reserved buffer space to 25000 bytes: pfc.my_pfc_ports.port_buffer_bytes = 25000
pfc.my_pfc_ports.xoff_size
The amount of reserved buffer that the switch must consume before sending a PFC pause frame out the set of interfaces in the port group list. The following example sends PFC pause frames after consuming 10000 bytes of reserved buffer: pfc.my_pfc_ports.xoff_size = 10000
pfc.my_pfc_ports.xon_delta
The number of bytes below the xoff threshold that the buffer consumption must drop below before sending PFC pause frames stops. The following example the buffer congestion must reduce by 2000 bytes (to 8000 bytes) before PFC pause frames stop: pfc.my_pfc_ports.xon_delta = 2000
pfc.my_pfc_ports.rx_enable
Enables (true) or disables (false) sending PFC pause frames. The default value is true. The following example enables sending PFC pause frames: pfc.my_pfc_ports.tx_enable = true
pfc.my_pfc_ports.tx_enable
Enables (true) or disables (false) receiving PFC pause frames. You do not need to define the COS values for rx_enable. The switch receives any COS value. The default value is true. The following example enables receiving PFC pause frames: pfc.my_pfc_ports.rx_enable = true
pfc.my_pfc_ports.cable_length
The length, in meters, of the cable that attaches to the port in the port group list. Cumulus Linux uses this value internally to determine the latency between generating a PFC pause frame and receiving the PFC pause frame. The default is 10 meters In this example, the cable is 5 meters: pfc.my_pfc_ports.cable_length = 5
ECN
You can create ECN profiles and assign them to different ports.
The following example creates a custom ECN profile called my-red-profile for egress queue (traffic-class) 1 and 2. The commands set the minimum buffer threshold to 40000 bytes, maximum buffer threshold to 200000 bytes, and the probability to 10. The commands also enable RED and apply the ECN profile to swp1 and swp2.
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 1,2 min-threshold-bytes 40000
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 1,2 max-threshold-bytes 200000
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 1,2 probability 10
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 1,2 red enable
cumulus@switch:~$ nv set interface swp1,swp2 qos congestion-control my-red-profile
cumulus@switch:~$ nv config apply
You can configure different thresholds and probability values for different traffic classes in a custom profile:
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 1,2 min-threshold-bytes 40000
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 1,2 max-threshold-bytes 200000
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 1,2 probability 10
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 1,2 red enable
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 4 min-threshold-bytes 30000
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 4 max-threshold-bytes 150000
cumulus@switch:~$ nv set qos congestion-control my-red-profile traffic-class 4 probability 80
cumulus@switch:~$ nv set interface swp1,swp2 qos congestion-control my-red-profile
cumulus@switch:~$ nv config apply
You can disable ECN bit marking for an ECN profile. The following example disables ECN bit marking in the my-red-profile profile:
Edit the Explicit Congestion Notification section of the /etc/cumulus/datapath/qos/qos_features.conf file.
The following example creates a custom ECN profile called my-red-profile for egress queue 1 and 2, with a minimum buffer threshold of 40000 bytes, maximum buffer threshold of 200000 bytes, and a probability of 10. The commands also enable RED and apply the ECN profile to swp1 and swp2.
You can only have a single lossless pool configured on the switch at a time. Configure the roce-lossless pool when you are using RoCE, otherwise configure the default-lossless pool.
You can configure multiple lossy pools concurrently.
You configure a traffic pool by associating switch priorities and defining the buffer memory percentages allocated to the pools. The following example associates switch priority 2 and allocates a memory percentage of 30 for the mc-lossy pool:
cumulus@switch:~$ nv set qos traffic-pool default-lossy switch-priority 0,1,3,4,5,6,7
cumulus@switch:~$ nv set qos traffic-pool default-lossy memory-percent 70
cumulus@switch:~$ nv set qos traffic-pool mc-lossy switch-priority 2
cumulus@switch:~$ nv set qos traffic-pool mc-lossy memory-percent 30
cumulus@switch:~$ nv config apply
Configure the following settings in the /etc/mlx/datapath/qos/qos_infra.conf file:
For additional default-lossless and RoCE pool examples, see Flow Control Buffers and RoCE. You can view traffic-pool configuration with the nv show qos traffic-pool <pool name> command:
You can use NVUE commands to tune advanced buffer properties in addition to the supported traffic pool configurations. Advanced buffer configuration can override the base traffic-pool profiles configured on the system.
You can only configure advanced buffer settings for the default-global profile.
Buffer Regions
You can adjust advanced buffer settings with the following NVUE command:
nv set qos advance-buffer-config default-global <buffer> <priority-group | property> <value>
You can adjust settings for the following supported buffer regions and properties:
Buffers
Supported Property Values
ingress-lossy-buffer
Cumulus Linux supports the following properties for the bulk, control, and service[1-6] priority groups: name - The priority group alias name. reserved - The reserved buffer allocation in bytes. service-pool - Service pool mapping. shared-alpha - The dynamic shared buffer alpha allocation. shared-bytes - The static shared buffer allocation in bytes. switch-priority - Switch priority values.
egress-lossless-buffer
reserved - The reserved buffer allocation in bytes. service-pool - Service pool mapping. shared-alpha - The dynamic shared buffer alpha allocation. shared-bytes - The static shared buffer allocation in bytes.
ingress-lossless-buffer
service-pool - Service pool mapping. shared-alpha - The dynamic shared buffer alpha allocation. shared-bytes - The static shared buffer allocation in bytes.
egress-lossy-buffer
multicast-port - Multicast port reserved or shared-bytes allocation in bytes. multicast-switch-priority [0-7] - Set the reserved, service-pool,shared-alpha, or shared-bytes properties for each multicast switch priority. traffic-class [0-15] - Set the reserved, service-pool,shared-alpha, or shared-bytes properties for each traffic class.
Configure shared-bytes for buffer regions mapped to static service pools, and shared-alpha for buffer regions mapped to dynamic service pools.
The shared buffer alpha value determines the proportion of available shared memory allocated across buffer regions. Regions with higher alpha values receive a higher proportion of available shared buffer memory. The following example changes the ingress-lossless-buffer shared alpha value to alpha_2 when using RoCE lossless mode:
You can configure ingress and egress service pool profile properties with the following NVUE commands:
nv set qos advance-buffer-config default-global ingress-pool <pool-id> <property> <value>
nv set qos advance-buffer-config default-global egress-pool <pool-id> <property> <value>
You can adjust the following properties for each pool:
Property
Description
infinite
The pool infinite flag.
memory-percent
The pool memory percent allocation.
mode
The pool mode: static or dynamic.
reserved
The reserved buffer allocation in bytes.
shared-alpha
The dynamic shared buffer alpha allocation.
shared-bytes
The static shared buffer allocation in bytes.
A relationship exists between the default traffic pools and the advanced buffer configuration settings.
Use caution when configuring advanced buffer settings. NVUE presents a warning if you attempt to apply incompatible traffic pool and advanced buffer configurations. NVUE performs the following validation checks before applying advanced buffer configurations:
You must map all switch priorities (0-7) to a priority group. You can map more than one switch priority to the same priority group.
The sum of memory-percent values across all ingress pools must be less than or equal to 100 percent.
The sum of memory-percent values across all egress pools must be less than or equal to 100 percent.
Reference the table below to view the mappings between the default traffic pool and advanced buffer properties:
Default Traffic Pool
Default Traffic Pool Properties
Advanced Buffer Region or Service Pool
Advanced Buffer Properties
default-lossy
memory-percent
ingress-pool 0 egress-pool 0
memory-percent
default-lossy
switch-priority
ingress-lossy-buffer
priority-group bulk switch-priority
default-lossless
memory-percent
ingress-pool 1 egress-pool 1
memory-percent
roce-lossless
memory-percent
ingress-pool 1 egress-pool 1
memory-percent
mc-lossy
memory-percent
ingress-pool 2 egress-pool 2
memory-percent
mc-lossy
switch-priority
ingress-lossy-buffer
priority-group service2 switch-priority
For example, to assign 20 percent of memory to a new static service pool, you must allow 20 percent of memory to be available from the default traffic pools. The following commands reduce the default-lossy traffic pool to 80 percent memory, allowing you to allocate the memory to ingress-pool 3:
Cumulus Linux provides a syntax checker for the qos_features.conf and qos_infra.conf files to check for errors, such missing parameters or invalid parameter labels and values.
The syntax checker runs automatically with every switchd reload.
You can run the syntax checker manually from the command line with the cl-consistency-check --datapath-syntax-check command. If errors exist, they write to stderr by default. If you run the command with -q, errors write to the /var/log/switchd.log file.
The cl-consistency-check --datapath-syntax-check command takes the following options:
Option
Description
-h
Displays this list of command options.
-q
Runs the command in quiet mode. Errors write to the /var/log/switchd.log file instead of stderr.
-qi
Runs the syntax checker against a specified qos_infra.conf file.
-qf
Runs the syntax checker against a specified qos_features.conf file.
By default the syntax checker assumes:
qos_infra.conf is in /etc/mlx/datapath/qos/qos_infra.conf
qos_features.conf is in /etc/cumulus/datapath/qos/qos_features.conf
You can run the syntax checker when switchd is either running or stopped.
Show Qos Counters
NVUE provides the following commands to show QoS statistics for an interface:
NVUE Command
Description
nv show interface <interface> counters qos
Shows all QoS statistics for a specific interface.
nv show interface <interface> counters qos egress-queue-stats
Shows QoS egress queue statistics for a specific interface.
nv show interface <interface> counters qos ingress-buffer-stats
Shows QoS ingress buffer statistics for a specific interface.
nv show interface <interface> counters qos pfc-stats
Shows QoS PFC statistics for a specific interface.
nv show interface <interface> counters qos port-stats
Shows QoS port statistics for a specific interface.
The following example shows all QoS statistics for swp1:
If you configure btoh breakout ports and QoS settings for breakout interfaces at the same time, errors might occur.
You must apply breakout port configuration before QoS configuration on the breakout ports. If you are using NVUE, configure breakout ports and perform an nv config apply first, then configure QoS settings on the breakout ports followed by another nv config apply. If you are using linux file configuration, modify ports.conf first, reload switchd, then modify qos_features.conf and reload switchd a second time.
QoS Settings on Bond Member Interfaces
If you use Linux commands to apply QoS settings on bond member interfaces instead of the logical bond interface, the members must share identical QoS configuration. If the configuration is not identical between bond interfaces, the bond inherits the _last_ interface you apply to the bond.
If QoS settings do not match, switchd reload fails; however, switchd restart does not fail.
NVUE rejects QoS configurations on bond member interfaces and shows an error when you try to apply the configurations; you must apply all QoS configuration on logical bond interfaces.
Cut-through Switching
You cannot disable cut-through switching on Spectrum ASICs. Cumulus Linux ignores the cut_through_enable = false setting in the qos_features.conf file.
RDMA over Converged Ethernet - RoCE
RoCE enables you to write to compute or storage elements using RDMA over an Ethernet network instead of using host CPUs. RoCE relies on ECN and PFC to operate. Cumulus Linux supports features that can enable lossless Ethernet for RoCE environments.
While Cumulus Linux can support RoCE environments, the end hosts must support the RoCE protocol.
RoCE helps you obtain a converged network, where all services run over the Ethernet infrastructure, including Infiniband apps.
Default RoCE Mode Configuration
The following table shows the default RoCE configuration for lossy and lossless mode.
Configuration
Lossy Mode
Lossless Mode
Port trust mode
YES
YES
Port switch priority to traffic class mapping
Switch priority 3 to traffic class 3 (RoCE)
Switch priority 6 to traffic class 6 (CNP)
Other switch priority to traffic class 0
YES
YES
Port ETS:
Traffic class 6 (CNP) - Strict
Traffic class 3 (RoCE) - WRR 50%
Traffic class 0 (Other traffic) - WRR 50%
YES
YES
Port ECN absolute threshold is 1501500 bytes for traffic class 3 (RoCE)
YES
YES
LLDP and Application TLV (RoCE) (UDP, Protocol:4791, Priority: 3)
YES
YES
Enable PFC on switch priority 3 (RoCE)
NO
YES
Switch priority 3 allocated to RoCE lossless traffic pool
NO
YES
Enable RDMA over Converged Ethernet lossless (with PFC and ECN)
RoCE uses the Infiniband (IB) Protocol over converged Ethernet. The IB global route header rides directly on top of the Ethernet header. The lossless Ethernet layer handles congestion hop by hop.
To configure RoCE with PFC and ECN:
cumulus@switch:~$ nv set qos roce
cumulus@switch:~$ nv config apply
NVUE defaults to roce mode lossless. The command nv set qos roce and nv set qos roce mode lossless are equivalent.
If you enable mode lossy, configuring nv set qos roce without a mode does not change the RoCE mode. To change to lossless, you must configure mode lossless.
Link pause is another way to provide lossless ethernet; however, PFC is the preferred method. PFC allows more granular control by pausing the traffic flow for a given CoS group instead of the entire link.
Enable RDMA over Converged Ethernet lossy (with ECN)
RoCEv2 requires flow control for lossless Ethernet. RoCEv2 uses the Infiniband (IB) Transport Protocol over UDP. The IB transport protocol includes an end-to-end reliable delivery mechanism and has its own sender notification mechanism.
RoCEv2 congestion management uses RFC 3168 to signal congestion experienced to the receiver. The receiver generates an RoCEv2 congestion notification packet directed to the source of the packet.
To configure RoCE with ECN:
cumulus@switch:~$ nv set qos roce mode lossy
cumulus@switch:~$ nv config apply
Remove RoCE Configuration
To remove RoCE configurations:
cumulus@switch:~$ nv unset qos roce
cumulus@switch:~$ nv config apply
Verify RoCE Configuration
You can verify RoCE configuration with NVUE nv show commands.
To show detailed information about the configured buffers, utilization and DSCP markings, run the nv show qos roce command:
cumulus@switch:mgmt:~$ nv show qos roce
operational applied description
------------------ ----------- -------- ------------------------------------------------------
enable on Turn the feature 'on' or 'off'. The default is 'off'.
mode lossless lossless Roce Mode
cable-length 100 100 Cable Length(in meters) for Roce Lossless Config
congestion-control
congestion-mode ECN Congestion config mode
enabled-tc 0,3 Congestion config enabled Traffic Class
max-threshold 1.43 MB Congestion config max-threshold
min-threshold 146.48 KB Congestion config min-threshold
pfc
pfc-priority 3 switch-prio on which PFC is enabled
rx-enabled enabled PFC Rx Enabled status
tx-enabled enabled PFC Tx Enabled status
trust
trust-mode pcp,dscp Trust Setting on the port for packet classification
RoCE PCP/DSCP->SP mapping configurations
===========================================
pcp dscp switch-prio
-- --- ----------------------- -----------
0 0 0,1,2,3,4,5,6,7 0
1 1 8,9,10,11,12,13,14,15 1
2 2 16,17,18,19,20,21,22,23 2
3 3 24,25,26,27,28,29,30,31 3
4 4 32,33,34,35,36,37,38,39 4
5 5 40,41,42,43,44,45,46,47 5
6 6 48,49,50,51,52,53,54,55 6
7 7 56,57,58,59,60,61,62,63 7
RoCE SP->TC mapping and ETS configurations
=============================================
switch-prio traffic-class scheduler-weight
-- ----------- ------------- ----------------
0 0 0 DWRR-50%
1 1 0 DWRR-50%
2 2 0 DWRR-50%
3 3 3 DWRR-50%
4 4 0 DWRR-50%
5 5 0 DWRR-50%
6 6 6 strict-priority
7 7 0 DWRR-50%
RoCE pool config
===================
name mode size switch-priorities traffic-class
-- --------------------- ------- ----- ----------------- -------------
0 lossy-default-ingress Dynamic 50.0% 0,1,2,4,5,6,7 -
1 roce-reserved-ingress Dynamic 50.0% 3 -
2 lossy-default-egress Dynamic 50.0% - 0,6
3 roce-reserved-egress Dynamic inf - 3
Exception List
=================
description
-- -----------
To show detailed RoCE information about a single interface, run the nv show interface <interface> qos roce status command.
cumulus@switch:mgmt:~$ nv show interface swp16 qos roce status
operational applied description
------------------ ------------- ------- ---------------------------------------------------
congestion-control
congestion-mode ecn, absolute Congestion config mode
enabled-tc 0,3 Congestion config enabled Traffic Class
max-threshold 1.43 MB Congestion config max-threshold
min-threshold 153.00 KB Congestion config min-threshold
pfc
pfc-priority 3 switch-prio on which PFC is enabled
rx-enabled yes PFC Rx Enabled status
tx-enabled yes PFC Tx Enabled status
trust
trust-mode pcp,dscp Trust Setting on the port for packet classification
mode lossless Roce Mode
RoCE PCP/DSCP->SP mapping configurations
===========================================
pcp dscp switch-prio
---- --- ---- -----------
cnp 6 48 6
roce 3 26 3
RoCE SP->TC mapping and ETS configurations
=============================================
switch-prio traffic-class scheduler-weight
---- ----------- ------------- ----------------
cnp 6 6 strict priority
roce 3 3 dwrr-50%
RoCE Pool Status
===================
name mode pool-id switch-priorities traffic-class size current-usage max-usage
-- --------------------- ------- ------- ----------------- ------------- -------- ------------- ---------
0 lossy-default-ingress DYNAMIC 2 0,1,2,4,5,6,7 - 15.16 MB 0 Bytes 16.00 MB
1 roce-reserved-ingress DYNAMIC 3 3 - 15.16 MB 7.30 MB 7.90 MB
2 lossy-default-egress DYNAMIC 13 - 0,6 15.16 MB 0 Bytes 16.01 MB
3 roce-reserved-egress DYNAMIC 14 - 3 inf 7.29 MB 13.47 MB
To show detailed information about current buffer utilization as well as historic RoCE byte and packet counts, run the nv show interface <interface> qos roce counters command:
cumulus@switch:mgmt:~$ nv show interface swp16 qos roce counters
operational applied description
----------------------------- ------------ ------- ------------------------------------------------------
rx-stats
rx-non-roce-stats
buffer-max-usage 144 Bytes Max Ingress Pool-buffer usage for non-RoCE traffic
buffer-usage 0 Bytes Current Ingress Pool-buffer usage for non-RoCE traffic
no-buffer-discard 55 Rx buffer discards for non-RoCE traffic
non-roce-bytes 56.52 MB non-roce rx bytes
non-roce-packets 462975 non-roce rx packets
pg-max-usage 144 Bytes Max PG-buffer usage for non-RoCE traffic
pg-usage 0 Bytes Current PG-buffer usage for non-RoCE traffic
rx-pfc-stats
pause-duration 0 Rx PFC pause duration for RoCE traffic
pause-packets 0 Rx PFC pause packets for RoCE traffic
rx-roce-stats
buffer-max-usage 0 Bytes Max Ingress Pool-buffer usage for RoCE traffic
buffer-usage 0 Bytes Current Ingress Pool-buffer usage for RoCE traffic
no-buffer-discard 0 Rx buffer discards for RoCE traffic
pg-max-usage 0 Bytes Max PG-buffer usage for RoCE traffic
pg-usage 0 Bytes Current PG-buffer usage for RoCE traffic
roce-bytes 0 Bytes Rx RoCE Bytes
roce-packets 0 Rx RoCE Packets
tx-stats
tx-cnp-stats
buffer-max-usage 16.02 MB Max Egress Pool-buffer usage for CNP traffic
buffer-usage 0 Bytes Current Egress Pool-buffer usage for CNP traffic
cnp-bytes 0 Bytes Tx CNP Packet Bytes
cnp-packets 0 Tx CNP Packets
tc-max-usage 0 Bytes Max TC-buffer usage for CNP traffic
tc-usage 0 Bytes Current TC-buffer usage for CNP traffic
unicast-no-buffer-discard 0 Tx buffer discards for CNP traffic
tx-ecn-stats
ecn-marked-packets 693777677344 Tx ECN marked packets
tx-pfc-stats
pause-duration 0 Tx PFC pause duration for RoCE traffic
pause-packets 0 Tx PFC pause packets for RoCE traffic
tx-roce-stats
buffer-max-usage 13.47 MB Max Egress Pool-buffer usage for RoCE traffic
buffer-usage 7.29 MB Current Egress Pool-buffer usage for RoCE traffic
roce-bytes 92824.38 GB Tx RoCE Packet bytes
roce-packets 803785675319 Tx RoCE Packets
tc-max-usage 16.02 MB Max TC-buffer usage for RoCE traffic
tc-usage 7.29 MB Current TC-buffer usage for RoCE traffic
unicast-no-buffer-discard 663060754115 Tx buffer discards for RoCE traffic
To reset the counters that the nv show interface <interface> qos roce command displays, run the nv action clear interface <interface> qos roce counters command.
Change RoCE Configuration
You can adjust RoCE settings using NVUE after you enable RoCE. To change the memory allocation for RoCE lossless mode to 60 percent:
cumulus@switch:mgmt:~$ nv set qos traffic-pool default-lossy memory-percent 40
cumulus@switch:mgmt:~$ nv set qos traffic-pool roce-lossless memory-percent 60
cumulus@switch:mgmt:~$ nv config apply
To change the memory allocation of the RoCE lossy traffic pool to 60 percent and remap switch priority 4 to RoCE lossy traffic:
cumulus@switch:mgmt:~$ nv set qos traffic-pool default-lossy switch-priority 0-3,5-7
cumulus@switch:mgmt:~$ nv set qos traffic-pool roce-lossy memory-percent 60
cumulus@switch:mgmt:~$ nv set qos traffic-pool default-lossy memory-percent 40
cumulus@switch:mgmt:~$ nv set qos traffic-pool roce-lossy switch-priority 4
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 4 traffic-class 3
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 3 traffic-class 0
cumulus@switch:mgmt:~$ nv set qos mapping default-global trust both
cumulus@switch:mgmt:~$ nv set qos mapping default-global dscp 26 switch-priority 4
cumulus@switch:mgmt:~$ nv config apply
To change the RoCE lossless switch priority from switch priority 3 to switch priority 2:
cumulus@switch:mgmt:~$ nv set qos pfc default-global switch-priority 2
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 2 traffic-class 3
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 3 traffic-class 0
cumulus@switch:mgmt:~$ nv set qos mapping default-global trust both
cumulus@switch:mgmt:~$ nv set qos mapping default-global dscp 26 switch-priority 2
DHCP is a client server protocol that automatically provides IP hosts with IP addresses and other related configuration information. A DHCP relay (agent) is a host that forwards DHCP packets between clients and servers that are not on the same physical subnet.
This topic describes how to configure DHCP relays for IPv4 and IPv6 using the following topology:
Basic Configuration
To set up DHCP relay, you need to provide the IP address of the DHCP server and the interfaces participating in DHCP relay (facing the server and facing the client). In an MLAG configuration, you must also specify the peerlink interface in case the local uplink interfaces fail.
In the example commands below:
The DHCP server IPv4 address is 172.16.1.102
The DHCP server IPv6 address is 2001:db8:100::2
vlan10 is the SVI for VLAN 10 and the uplinks are swp51 and swp52
peerlink.4094 is the MLAG interface
cumulus@leaf01:~$ nv set service dhcp-relay default interface swp51
cumulus@leaf01:~$ nv set service dhcp-relay default interface swp52
cumulus@leaf01:~$ nv set service dhcp-relay default interface vlan10
cumulus@leaf01:~$ nv set service dhcp-relay default interface peerlink.4094
cumulus@leaf01:~$ nv set service dhcp-relay default server 172.16.1.102
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ nv set service dhcp-relay6 default interface upstream swp51 server-address 2001:db8:100::2
cumulus@leaf01:~$ nv set service dhcp-relay6 default interface upstream swp52 server-address 2001:db8:100::2
cumulus@leaf01:~$ nv set service dhcp-relay6 default interface downstream vlan10
cumulus@leaf01:~$ nv set service dhcp-relay6 default interface downstream peerlink.4094
cumulus@leaf01:~$ nv config apply
Edit the /etc/default/isc-dhcp-relay-default file to add the IP address of the DHCP server and the interfaces participating in DHCP relay.
You configure a DHCP relay on a per-VLAN basis, specifying the SVI, not the parent bridge. In the example above, you specify vlan10 as the SVI for VLAN 10 but you do not specify the bridge named bridge.
When you configure DHCP relay with VRR, the DHCP relay client must run on the SVI; not on the -v0 interface.
For every instance of a DHCP relay in a non-default VRF, you need to create a separate default file in the /etc/default directory. See DHCP with VRF.
Optional Configuration
This section describes optional DHCP relay configurations. The steps provided in this section assume that you have already configured basic DHCP relay, as described above.
DHCP Agent Information Option (Option 82)
Cumulus Linux supports DHCP Agent Information Option 82, which allows a DHCP relay to insert circuit or relay specific information into a request that the switch forwards to a DHCP server. You can use the following options:
Circuit ID includes information about the circuit on which the request comes in, such as the SVI or physical port. By default, this is the printable name of the interface that receives the client request.
Remote ID includes information that identifies the relay agent, such as the MAC address. By default, this is the system MAC address of the device on which DHCP relay is running.
To configure DHCP Agent Information Option 82:
Edit the /etc/default/isc-dhcp-relay-default file and add one of the following options:
To inject the ingress SVI interface against which DHCP processes the relayed DHCP discover packet, add -a to the OPTIONS line:
cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay-default
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-a"
To inject the physical switch port on which the relayed DHCP discover packet arrives instead of the SVI, add -a --use-pif-circuit-id to the OPTIONS line:
cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay-default
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-a --use-pif-circuit-id"
To customize the Remote ID sub-option, add -a -r to the OPTIONS line followed by a custom string (up to 255 characters):
cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay-default
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-a -r CUSTOMVALUE"
Restart the dhcrelay service to apply the new configuration:
When you need DHCP relay in an environment that relies on an anycast gateway (such as EVPN), a unique IP address is necessary on each device for return traffic. By default, in a BGP unnumbered environment with DHCP relay, the source IP address is the loopback IP address and the gateway IP address is the SVI IP address. However with anycast traffic, the SVI IP address is not unique to each rack; it is typically shared between racks. Most EVPN ToR deployments only use a single unique IP address, which is the loopback IP address.
RFC 3527 enables the DHCP server to react to these environments by introducing a new parameter to the DHCP header called the link selection sub-option, which the DHCP relay agent builds. The link selection sub-option takes on the normal role of the gateway address in relaying to the DHCP server which subnet correlates to the DHCP request. When using this sub-option, the gateway address continues to be present but only relays the return IP address that the DHCP server uses; the gateway address becomes the unique loopback IP address.
When enabling RFC 3527 support, you can specify an interface, such as the loopback interface or a switch port interface to use as the gateway address. The relay picks the first IP address on that interface. If the interface has multiple IP addresses, you can specify a specific IP address for the interface.
RFC 3527 supports IPv4 DHCP relays only.
To enable RFC 3527 support and control the gateway address:
Run the nv set service dhcp-relay default gateway-interface command with the interface or IP address you want to use. The following example uses the first IP address on the loopback interface as the gateway IP address:
cumulus@leaf01:~$ nv set service dhcp-relay default gateway-interface lo
The first IP address on the loopback interface is typically the 127.0.0.1 address. This example uses IP address 10.10.10.1 on the loopback interface as the gateway address:
cumulus@leaf01:~$ nv set service dhcp-relay default gateway-interface lo address 10.10.10.1
This example uses the first IP address on swp2 as the gateway address:
cumulus@leaf01:~$ nv set service dhcp-relay default gateway-interface swp2
This example uses IP address 10.0.0.4 on swp2 as the gateway address:
cumulus@leaf01:~$ nv set service dhcp-relay default gateway-interface swp2 address 10.0.0.4
Edit the /etc/default/isc-dhcp-relay-default file and provide the -U option with the interface or IP address you want to use as the gateway address.
This example uses the first IP address on the loopback interface as the gateway address:
cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay-default
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-U lo"
The first IP address on the loopback interface is typically the 127.0.0.1 address. This example uses IP address 10.10.10.1 on the loopback interface as the gateway address:
cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay-default
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-U 10.10.10.1%lo"
This example uses the first IP address on swp2 as the gateway address:
cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay-default
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-U swp2"
This example uses IP address 10.0.0.4 on swp2 as the gateway address:
cumulus@leaf01:~$ sudo nano /etc/default/isc-dhcp-relay-default
...
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-U 10.0.0.4%swp2"
Restart the dhcrelay service to apply the configuration change:
When enabling RFC 3527 support, you can specify an interface such as the loopback interface or swp interface for the gateway address. The interface you use must be reachable in the tenant VRF that it is servicing and must be unique to the switch. In EVPN symmetric routing, fabrics running an anycast gateway that use the same SVI IP address on multiple leaf switches need a unique IP address for the VRF interface and must include the layer 3 VNI for this VRF in the DHCP Relay configuration. For example:
Gateway IP Address as Source IP for Relayed DHCP Packets (Advanced)
You can configure the dhcrelay service to forward IPv4 (only) DHCP packets to a DHCP server and ensure that the source IP address of the relayed packet is the same as the gateway IP address.
This option impacts all relayed IPv4 packets globally.
To use the gateway IP address as the source IP address:
cumulus@leaf01:~$ nv set service dhcp-relay default source-ip giaddress
cumulus@leaf01:~$ nv config apply
Edit the /etc/default/isc-dhcp-relay-default file to add --giaddr-src to the OPTIONS line.
Cumulus Linux supports multiple DHCP relay daemons on a switch to enable relaying of packets from different bridges to different upstream interfaces.
To configure multiple DHCP relay daemons on a switch:
In the /etc/default directory, create a configuration file for each DHCP relay daemon. Use the naming scheme isc-dhcp-relay-<dhcp-name> for IPv4 or isc-dhcp-relay6-<dhcp-name> for IPv6. This is an example configuration file for IPv4:
# Defaults for isc-dhcp-relay initscript
# sourced by /etc/init.d/isc-dhcp-relay
# installed at /etc/default/isc-dhcp-relay by the maintainer scripts
#
# This is a POSIX shell fragment
#
# What servers should the DHCP relay forward requests to?
SERVERS="102.0.0.2"
# On what interfaces should the DHCP relay (dhrelay) serve DHCP requests?
# Always include the interface towards the DHCP server.
# This variable requires a -i for each interface configured above.
# This will be used in the actual dhcrelay command
# For example, "-i eth0 -i eth1"
INTF_CMD="-i swp2s2 -i swp2s3"
# Additional options that are passed to the DHCP relay daemon?
OPTIONS=""
Run the following command to start a dhcrelay instance, where <dhcp-name> is the instance name or number.
To see how DHCP relay is working on your switch, run the journalctl command:
cumulus@leaf01:~$ sudo journalctl -l -n 20 | grep dhcrelay
Dec 05 20:58:55 leaf01 dhcrelay[6152]: sending upstream swp52
Dec 05 20:58:55 leaf01 dhcrelay[6152]: sending upstream swp51
Dec 05 20:58:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
Dec 05 20:58:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
Dec 05 21:03:55 leaf01 dhcrelay[6152]: Relaying Renew from fe80::4638:39ff:fe00:3 port 546 going up.
Dec 05 21:03:55 leaf01 dhcrelay[6152]: sending upstream swp52
Dec 05 21:03:55 leaf01 dhcrelay[6152]: sending upstream swp51
Dec 05 21:03:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
Dec 05 21:03:55 leaf01 dhcrelay[6152]: Relaying Reply to fe80::4638:39ff:fe00:3 port 546 down.
To specify a time period with the journalctl command, use the --since flag:
cumulus@leaf01:~$ sudo journalctl -l --since "2 minutes ago" | grep dhcrelay
Dec 05 21:08:55 leaf01 dhcrelay[6152]: Relaying Renew from fe80::4638:39ff:fe00:3 port 546 going up.
Dec 05 21:08:55 leaf01 dhcrelay[6152]: sending upstream swp52
Dec 05 21:08:55 leaf01 dhcrelay[6152]: sending upstream swp51
Configuration Errors
If you configure DHCP relays by editing the /etc/default/isc-dhcp-relay-default file manually, you can introduce configuration errors that cause the switch to crash.
For example, if you see an error similar to the following, check that there is no space between the DHCP server address and the interface you use as the uplink.
Core was generated by /usr/sbin/dhcrelay --nl -d -i vx-40 -i vlan10 10.0.0.4 -U 10.0.1.2 %vlan20.
Program terminated with signal SIGSEGV, Segmentation fault.
To resolve the issue, manually edit the /etc/default/isc-dhcp-relay-default file to remove the space, then run the systemctl restart dhcrelay@default.service command to restart the dhcrelay service and apply the configuration change.
Considerations
The dhcrelay command does not bind to an interface if the interface name is longer than 14 characters. This is a known limitation in dhcrelay.
DHCP packets received on bridge ports and sent to the CPU for processing cause the RX_DROP counter to increment on the interface.
DHCP Servers
A DHCP server automatically provides and assigns IP addresses and other network parameters to client devices. It relies on DHCP to respond to broadcast requests from clients.
This section shows you how to configure a DHCP server using the following topology, where the DHCP server is a switch running Cumulus Linux.
To configure the DHCP server on a Cumulus Linux switch:
Create a DHCP pool by providing a pool ID. The ID is an IPv4 or IPv6 prefix.
Provide a name for the pool (optional).
Provide the IP address of the DNS Server you want to use in this pool. You can assign multiple DNS servers.
Provide the domain name you want to use for this pool for name resolution (optional).
Define the range of IP addresses available for assignment.
Provide the default gateway IP address (optional).
In addition, you can configure a static IP address for a resource, such as a server or printer:
Create an ID for the static assignment. This is typically the name of the resource.
Provide the static IP address you want to assign to this resource.
Provide the MAC address of the resource to which you want to assign the IP address.
To configure static IP address assignments, you must first configure a pool.
You can set the DNS server IP address and domain name globally or specify different DNS server IP addresses and domain names for different pools. The following example commands configure a DNS server IP address and domain name for a pool.
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 pool-name storage-servers
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 domain-name example.com
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 domain-name-server 192.168.200.53
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 range 10.1.10.100 to 10.1.10.199
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 gateway 10.1.10.1
cumulus@switch:~$ nv set service dhcp-server default static server1
cumulus@switch:~$ nv set service dhcp-server default static server1 ip-address 10.0.0.2
cumulus@switch:~$ nv set service dhcp-server default static server1 mac-address 44:38:39:00:01:7e
cumulus@switch:~$ nv config apply
To allocate DHCP addresses from the configured pool, you must configure an interface with an IP address from the pool subnet. For example:
cumulus@switch:~$ nv set interface vlan10 ip address 10.1.10.1/24
cumulus@switch:~$ nv config apply
To set the DNS server IP address and domain name globally, use the nv set service dhcp-server <vrf> domain-name-server <address> and nv set service dhcp-server <vrf> domain-name <domain> commands.
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::/64
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::/64 pool-name storage-servers
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::/64 domain-name-server 2001:db8:100::64
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::/64 domain-name example.com
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::/64 range 2001:db8::100 to 2001:db8::199
cumulus@switch:~$ nv set service dhcp-server6 default static server1
cumulus@switch:~$ nv set service dhcp-server6 default static server1 ip-address 2001:db8::100
cumulus@switch:~$ nv set service dhcp-server6 default static server1 mac-address 44:38:39:00:01:7e
cumulus@switch:~$ nv config apply
To allocate DHCP addresses from the configured pool, you must configure an interface with an IP address from the pool subnet. For example:
cumulus@switch:~$ nv set interface vlan10 ip address 2001:db8::10/64
cumulus@switch:~$ nv config apply
To set the DNS server IP address and domain name globally, use the nv set service dhcp-server6 <vrf> domain-name-server <address> and nv set service dhcp-server6 <vrf> domain-name <domain> commands.
In a text editor, edit the /etc/dhcp/dhcpd.conf file. Use following configuration as an example:
To set the DNS server IP address and domain name globally, add the DNS server IP address and domain name before the pool information in the /etc/dhcp/dhcpd.conf file. For example:
To set the DNS server IP address and domain name globally, add the DNS server IP address and domain name before the pool information in the /etc/dhcp/dhcpd6.conf file. For example:
You can set the network address lease time assigned to DHCP clients. You can specify a number between 180 and 31536000. The default lease time is 600 seconds.
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 lease-time 200000
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::/64 lease-time 200000
cumulus@switch:~$ nv config apply
Edit the /etc/dhcp/dhcpd.conf file to set the lease time (in seconds):
Configure the DHCP server to ping the address you want to assign to a client before issuing the IP address. If there is no response, DHCP delivers the IP address; otherwise, it attempts the next available address in the range.
cumulus@switch:~$ nv set service dhcp-server default pool 10.1.10.0/24 ping-check on
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv set service dhcp-server6 default pool 2001:db8::/64 ping-check on
cumulus@switch:~$ nv config apply
Edit the /etc/dhcp/dhcpd.conf file to add ping-check true;:
You can assign an IP address and other DHCP options based on physical location or port regardless of MAC address to clients that attach directly to the Cumulus Linux switch through a switch port. This is helpful when swapping out switches and servers; you can avoid the inconvenience of collecting the MAC address and sending it to the network administrator to modify the DHCP server configuration.
Cumulus Linux does not provide NVUE commands for this setting.
Cumulus Linux does not provide NVUE commands for this setting.
Edit the /etc/dhcp/dhcpd.conf file to add the interface and IP address:
To show the current DHCP server settings, run the nv show service dhcp-server command:
cumulus@leaf01:mgmt:~$ nv show service dhcp-server
Summary
--------- ------------------
+ default interface: "swp1
default pool: 10.1.10.0/24
default static: server1
The DHCP server determines if a DHCP request is a relay or a non-relay DHCP request. Run the following command to see the DHCP request:
cumulus@server02:~$ sudo tail /var/log/syslog | grep dhcpd
2016-12-05T19:03:35.379633+00:00 server02 dhcpd: Relay-forward message from 2001:db8:101::1 port 547, link address 2001:db8:101::1, peer address fe80::4638:39ff:fe00:3
2016-12-05T19:03:35.380081+00:00 server02 dhcpd: Advertise NA: address 2001:db8::110 to client with duid 00:01:00:01:1f:d8:75:3a:44:38:39:00:00:03 iaid = 956301315 valid for 600 seconds
2016-12-05T19:03:35.380470+00:00 server02 dhcpd: Sending Relay-reply to 2001:db8:101::1 port 547
Considerations
DHCP packets received on bridge ports and sent to the CPU for processing cause the RX_DROP counter to increment on the interface.
DHCP Snooping
DHCP snooping enables Cumulus Linux to act as a middle layer between the DHCP infrastructure and DHCP clients by scanning DHCP control packets and building an IP-MAC database. Cumulus Linux accepts DHCP offers from only trusted interfaces and can rate limit packets.
DHCP option 82 processing is not supported.
Configure DHCP Snooping
To configure DHCP snooping, you need to:
Enable DHCP snooping on a VLAN.
Add a trusted interface. Cumulus Linux allows DHCP offers from only trusted interfaces to prevent malicious DHCP servers from assigning IP addresses inside the network. The interface must be a member of the bridge specified.
Set the rate limit for DHCP requests to avoid DoS attacks. The default value is 100 packets per second.
The following example shows you how to configure DHCP snooping for IPv4 and IPv6.
NVUE does not provide commands to configure DHCP Snooping.
Create the /etc/dhcpsnoop/dhcp_snoop.json file and add DHCP snooping configuration under the bridge.
The following example enables DHCP snooping for IPv4 on VLAN 10, sets the rate limit to 50 and the trusted interface to swp3. swp3 is a member of the bridge br_default:
The following example enables DHCP snooping for IPv6 on VLAN 10, sets the rate limit to 50 and the trusted interface to swp6. swp6 is a member of the bridge br_default:
When DHCP snooping detects a violation, the packet is dropped and a message is logged to the /var/log/dhcpsnoop.log file.
Show the DHCP Binding Table
To show the DHCP binding table, run the net show dhcp-snoop table command for IPv4 or the net show dhcp-snoop6 table command for IPv6. The following example command shows the DHCP binding table for IPv4:
cumulus@leaf01:~$ net show dhcp-snoop table
Port VLAN IP MAC Lease State Bridge
---- ---- --------- ----------------- ----- ----- ------
swp5 1002 10.0.0.3 00:02:00:00:00:04 7200 ACK br0
swp5 1000 10.0.1.3 00:02:00:00:00:04 7200 ACK br0
Prescriptive Topology Manager - PTM
In data center topologies, right cabling is time consuming and error prone. PTM is a dynamic cabling verification tool that can detect and eliminate errors. PTM uses a Graphviz-DOT specified network cabling plan in a topology.dot file and couples it with runtime information from LLDP to verify that the cabling matches the specification. The check occurs on every link transition on each node in the network.
You can customize the topology.dot file to control ptmd at both the global/network level and the node/port level.
PTM runs as a daemon, named ptmd.
Supported Features
Topology verification using LLDP. ptmd creates a client connection to the LLDP daemon, lldpd, and retrieves the neighbor relationship between the nodes/ports in the network and compares them against the prescribed topology specified in the topology.dot file.
PTM only supports physical interfaces, such as swp1 or eth0. You cannot specify virtual interfaces, such as bonds or subinterfaces in the topology file.
Client management: ptmd creates an abstract named socket /var/run/ptmd.socket on startup. Other applications can connect to this socket to receive notifications and send commands.
Event notifications: see Scripts below.
User configuration through a topology.dot file; see below.
Configure PTM
ptmd verifies the physical network topology against a DOT-specified network graph file, /etc/ptm.d/topology.dot.
At startup, ptmd connects to lldpd (the LLDP daemon) over a Unix socket and retrieves the neighbor name and port information. It then compares the retrieved port information with the configuration information that it reads from the topology file. If there is a match, it is a PASS, otherwise it is a FAIL.
PTM performs its LLDP neighbor check using the PortID ifname TLV information.
ptmd Scripts
ptmd executes scripts at /etc/ptm.d/if-topo-pass and /etc/ptm.d/if-topo-failfor each interface that goes through a change and runs if-topo-pass when an LLDP or BFD check passes or if-topo-fails when the check fails. The scripts receive an argument string that is the result of the ptmctl command; see ptmd commands below.
You can modify these default scripts.
Configuration Parameters
You can configure ptmd parameters in the topology file. The parameters are host-only, global, per-port/node and templates.
Host-only Parameters
Host-only parameters apply to the entire host on which PTM is running. You can include the hostnametype host-only parameter, which specifies if PTM uses only the hostname (hostname) or the fully qualified domain name (fqdn) while looking for the self-node in the graph file. For example, in the graph file below PTM ignores the FQDN and only looks for switch04 because that is the hostname of the switch on which it is running:
Always wrap the hostname in double quotes; for example, "www.example.com" to prevent ptmd from failing.
To avoid errors when starting the ptmd process, make sure that /etc/hosts and /etc/hostname both reflect the hostname you are using in the topology.dot file.
Global parameters apply to every port in the topology file. There are two global parameters: LLDP and BFD. LLDP is on by default; if no keyword is present, PTM uses the default values for all ports. However, BFD is off if no keyword is present unless a per-port override exists. For example:
Templates provide flexibility in choosing different parameter combinations and applying them to a given port. A template instructs ptmd to reference a named parameter string instead of a default one. There are two parameter strings ptmd supports:
bfdtmpl specifies a custom parameter tuple for BFD.
lldptmpl specifies a custom parameter tuple for LLDP.
match_type, which defaults to the interface name (ifname), but can accept a port description (portdescr) instead if you want lldpd to compare the topology against the port description instead of the interface name. You can set this parameter globally or at the per-port level.
match_hostname, which defaults to the hostname (hostname), but enables PTM to match the topology using the fully qualified domain name (fqdn) supplied by LLDP.
The following is an example of a topology with LLDP at the port level:
When you specify match_hostname=fqdn, ptmd matches the entire FQDN, (cumulus-2.domain.com in the example below). If you do not specify anything for match_hostname, ptmd matches based on hostname only, (cumulus-3 below), and ignores the rest of the URL:
BFD provides low overhead and rapid detection of failures in the paths between two network devices. It provides a unified mechanism for link detection over all media and protocol layers. Use BFD to detect failures for IPv4 and IPv6 single or multihop paths between any two network devices, including unidirectional path failure detection. For information about configuring BFD using PTM, see BFD.
Check Link State
You can enable PTM to perfom additional checks to ensure that routing adjacencies form only on links that have connectivity and that conform to the specification that ptmd defines.
You only need to enable PTM to check link state. You do not need to enable PTM to determine BFD status.
cumulus@switch:~$ nv set router ptm enable
cumulus@switch:~$ nv config apply
To disable the check link state, set the no ptm-enable parameter:
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# no ptm-enable
switch(config)# end
switch# write memory
switch# exit
cumulus@switch:~$
To check PTM status on an interface, run the net show interface <interface> command or the vtysh show interface <interface> command.
cumulus@switch:~$ net show interface swp4
Name MAC Speed MTU Mode
----- ---- ----------------- ----- ---- -------------
ADMDN swp4 48:b0:2d:59:0a:de N/A 1500 NotConfigured
Routing
-------
Interface swp4 is up, line protocol is up
Link ups: 0 last: (never)
Link downs: 0 last: (never)
PTM status: disabled
vrf: default
index 3 metric 0 mtu 1550 speed 4294967295
flags: <UP,BROADCAST,RUNNING,MULTICAST>
Type: Ethernet
HWaddr: c4:54:44:bd:01:41
...
ptmd Service Commands
PTM sends client notifications in CSV format.
To start or restart the ptmd service, run the following command. The topology.dot file must be present for the service to start.
cumulus@switch:~$ sudo systemctl status ptmd.service
ptmctl Commands
ptmctl is a client of ptmd that retrieves the operational state of the ports configured on the switch and information about BFD sessions from ptmd. ptmctl parses the CSV notifications sent by ptmd. See man ptmctl for more information.
ptmctl Examples
The examples below contain the following keywords in the output of the cbl status column:
cbl status Keyword
Definition
pass
The topology file defines the interface, the interface receives LLDP information, and the LLDP information for the interface matches the information in the topology file.
fail
The topology file defines the interface, the interface receives LLDP information, and the LLDP information for the interface does not match the information in the topology file.
N/A
The topology file defines the interface but the interface does not receive LLDP information. The interface might be down or disconnected, or the neighbor is not sending LLDP packets. The N/A and fail status might indicate a wiring problem to investigate. The N/A status does not show when you use the -l option with ptmctl; The output shows only interfaces that are receiving LLDP information.
For basic output, use ptmctl without any options:
cumulus@switch:~$ sudo ptmctl
-------------------------------------------------------------
port cbl BFD BFD BFD BFD
status status peer local type
-------------------------------------------------------------
swp1 pass pass 11.0.0.2 N/A singlehop
swp2 pass N/A N/A N/A N/A
swp3 pass N/A N/A N/A N/A
For more detailed output, use the -d option:
cumulus@switch:~$ sudo ptmctl -d
--------------------------------------------------------------------------------------
port cbl exp act sysname portID portDescr match last BFD BFD
status nbr nbr on upd Type state
--------------------------------------------------------------------------------------
swp45 pass h1:swp1 h1:swp1 h1 swp1 swp1 IfName 5m: 5s N/A N/A
swp46 fail h2:swp1 h2:swp1 h2 swp1 swp1 IfName 5m: 5s N/A N/A
#continuation of the output
-------------------------------------------------------------------------------------------------
BFD BFD det_mult tx_timeout rx_timeout echo_tx_timeout echo_rx_timeout max_hop_cnt
peer DownDiag
-------------------------------------------------------------------------------------------------
N/A N/A N/A N/A N/A N/A N/A N/A
N/A N/A N/A N/A N/A N/A N/A N/A
To return information on active BFD sessions ptmd is tracking, use the -b option:
cumulus@switch:~$ sudo ptmctl -b
----------------------------------------------------------
port peer state local type diag
----------------------------------------------------------
swp1 11.0.0.2 Up N/A singlehop N/A
N/A 12.12.12.1 Up 12.12.12.4 multihop N/A
To return LLDP information, use the -l option. The output returns only the active neighbors that ptmd is tracking.
cumulus@switch:~$ sudo ptmctl -l
---------------------------------------------
port sysname portID port match last
descr on upd
---------------------------------------------
swp45 h1 swp1 swp1 IfName 5m:59s
swp46 h2 swp1 swp1 IfName 5m:59s
To return detailed information on active BFD sessions ptmd is tracking, use the -b and -d option (results are for an IPv6-connected peer):
cumulus@switch:~$ sudo ptmctl -b -d
----------------------------------------------------------------------------------------
port peer state local type diag det tx_timeout rx_timeout
mult
----------------------------------------------------------------------------------------
swp1 fe80::202:ff:fe00:1 Up N/A singlehop N/A 3 300 900
swp1 3101:abc:bcad::2 Up N/A singlehop N/A 3 300 900
#continuation of output
---------------------------------------------------------------------
echo echo max rx_ctrl tx_ctrl rx_echo tx_echo
tx_timeout rx_timeout hop_cnt
---------------------------------------------------------------------
0 0 N/A 187172 185986 0 0
0 0 N/A 501 533 0 0
ptmctl Error Outputs
If there are errors in the topology file or there is no session, PTM returns appropriate outputs. Typical error strings are:
Topology file error [/etc/ptm.d/topology.dot] [cannot find node cumulus] -
please check /var/log/ptmd.log for more info
Topology file error [/etc/ptm.d/topology.dot] [cannot open file (errno 2)] -
please check /var/log/ptmd.log for more info
No Hostname/MgmtIP found [Check LLDPD daemon status] -
please check /var/log/ptmd.log for more info
No BFD sessions . Check connections
No LLDP ports detected. Check connections
Unsupported command
For example:
cumulus@switch:~$ sudo ptmctl
-------------------------------------------------------------------------
cmd error
-------------------------------------------------------------------------
get-status Topology file error [/etc/ptm.d/topology.dot]
[cannot open file (errno 2)] - please check /var/log/ptmd.log
for more info
If you encounter errors with the topology.dot file, you can use dot (included in the Graphviz package) to validate the syntax of the topology file.
Open the topology file with Graphviz to ensure that it is readable and that the file format is correct.
If you edit topology.dot file from a Windows system, be sure to double check the file formatting; there might be extra characters that keep the graph from working correctly.
Basic Topology Example
The following example shows a basic example DOT file and its corresponding topology diagram. Use the same topology.dot file on all switches and do not split the file for each device to allow for easy automation by using the same exact file on each device.
When ptmd is in an incorrect failure state and you enable the Zebra interface, PIF BGP sessions do not establish the route but the subinterface does establish routes.
If the subinterface is on the physical interface and PTM marks the physical interface in a PTM FAIL state, FRR does not process routes on the physical interface, but the subinterface is working.
Commas in Port Descriptions
If an LLDP neighbor advertises a PortDescr that contains commas, ptmctl -d splits the string on the commas and misplaces its components in other columns. Do not use commas in your port descriptions.
Port security is a layer 2 traffic control feature that enables you to manage network access from end-users. Use port security to:
Limit port access to specific MAC addresses so that the port does not forward ingress traffic from source addresses that are not defined.
Limit port access to only the first learned MAC address on the port (sticky MAC) so that the device with that MAC address has full bandwidth. You can provide a timeout so that the MAC address on that port no longer has access after a specified time.
Limit port access to a specific number of MAC addresses.
You can specify what action to take when there is a port security violation (drop packets or put the port into ADMIN down state) and add a timeout for the action to take effect.
Layer 2 interfaces in trunk or access mode are currently supported. However, interfaces in a bond are not supported.
NVUE commands are not available for port security configuration.
Configure Port Security
To configure port security, add the configuration settings you want to use to the /etc/cumulus/switchd.d/port_security.conf file, then restart switchd to apply the changes.
Setting
Description
interface.<port>.port_security.enable
1 enables security on the port. 0 disables security on the port.
interface.<port>.port_security.mac_limit
The maximum number of MAC addresses allowed to access the port. You can specify a number between 0 and 512. The default is 32.
interface.<port>.port_security.static_mac
The specific MAC addresses allowed to access the port. You can specify multiple MAC addresses. Separate each MAC address with a space.
interface.<port>.port_security.sticky_mac
1 enables sticky MAC, where the first learned MAC address on the port is the only MAC address allowed. 0 disables sticky MAC.
interface.<port>.port_security.sticky_timeout
The time period after which the first learned MAC address ages out and no longer has access to the port. The default aging timeout value is 30 minutes. You can specify a value between 0 and 60 minutes.
interface.<port>.port_security.sticky_aging
1 enables sticky MAC aging. 0 disables sticky MAC aging.
interface.<port>.port_security.violation_mode
The violation mode: 0 (shutdown) puts a port into ADMIN down state. 1 (restrict) drops packets.
interface.<port>.port_security.violation_timeout
The number of seconds after which the violation mode times out. You can specify a value between 0 and 3600 seconds. The default value is 1800 seconds.
The following example shows an /etc/cumulus/switchd.d/port_security.conf configuration file:
The lldpd daemon implements the IEEE802.1AB LLDP standard and starts at system boot. All lldpd command line arguments are in the /etc/default/lldpd file.
lldpd supports CDP (Cisco Discovery Protocol, v1 and v2) and logs by default into /var/log/daemon.log with an lldpd prefix.
Configure LLDP Timers
You can configure the frequency of LLDP updates (between 5 and 32768 seconds) and the amount of time (between 1 and 8192 seconds) to hold the information before discarding it. The hold time interval is a multiple of the tx-interval.
The following example commands configure the frequency of LLDP updates to 100 and the hold time to 3.
cumulus@switch:~$ nv set service lldp tx-interval 100
cumulus@switch:~$ nv set service lldp tx-hold-multiplier 3
cumulus@switch:~$ nv config apply
Create the /etc/lldpd.conf file or create a file in the /etc/lldpd.d/ directory with a .conf suffix and add the timers:
Restart the lldpd service for the changes to take effect:
cumulus@switch:~$ sudo systemctl restart lldpd
Disable LLDP on an Interface
To disable LLDP on a single interface, edit the /etc/default/lldpd file. This file specifies the default options to present to the lldpd service when it starts. The following example uses the -I option to disable LLDP on swp43:
cumulus@switch:~$ sudo nano /etc/default/lldpd
# Add "-x" to DAEMON_ARGS to start SNMP subagent
# Enable CDP by default
DAEMON_ARGS="-c -I *,!swp43"
Restart the lldpd service for the changes to take effect:
cumulus@switch:~$ sudo systemctl restart lldpd
▼
Runtime Configuration (Advanced)
A runtime configuration does not persist when you reboot the switch; you lose all changes.
To configure active interfaces:
cumulus@switch:~$ sudo lldpcli configure system interface pattern "swp*"
To configure inactive interfaces:
cumulus@switch:~$ sudo lldpcli configure system interface pattern *,!eth0,swp*
The active interface list always overrides the inactive interface list.
To reset any interface list to none:
cumulus@switch:~$ sudo lldpcli configure system interface pattern ""
Enable the SNMP Subagent
LLDP does not enable the SNMP subagent by default. To enable the SNMP subagent, edit the /etc/default/lldpd file and add the -x option:
cumulus@switch:~$ sudo nano /etc/default/lldpd
# Add "-x" to DAEMON_ARGS to start SNMP subagent
# Enable CDP by default
DAEMON_ARGS="-c -x -M 4"
Restart the lldpd service for the changes to take effect:
cumulus@switch:~$ sudo systemctl restart lldpd
The -c option enables backwards compatibility with CDP. See Change CDP Settings below.
The -M 4 option sends a field in discovery packets to indicate that the switch is a network device.
Change CDP Settings
Cumulus Linux provides support for CDP so that the switch can advertise information about itself with Cisco routers that do not support LLDP. By default, the Cumulus Linux switch sends CDP packets only if the peer sends CDP packets. You can change this setting by replacing -c in the /etc/default/lldpd file with one of the following options:
Option
Description
-cc
The Cumulus Linux switch sends CDPv1 packets even when there is no detected CDP peer.
-ccc
The Cumulus Linux switch sends CDPv2 packets even when there is no detected CDP peer.
-cccc
The Cumulus Linux switch disables CDPv1 and enables CDPv2.
-ccccc
The Cumulus Linux switch disables CDPv1 and forces CDPv2.
The following example changes the CDP setting to -ccc so that the switch sends CDPv2 packets even when there is no detected CDP peer:
You must restart the lldpd service for the changes to take effect.
cumulus@switch:~$ sudo systemctl restart lldpd
Set LLDP Mode
By default, the lldpd service sends LLDP frames unless it detects a CDP peer, then it sends CDP frames. You can change this behavior and configure the lldpd service to send only CDP frames or only LLDP frames.
You configure the lldpd service to send only CDP or only LLDP frames globally for all interfaces; you cannot configure these settings for specific interfaces.
You must restart the lldpd service for the changes to take effect.
cumulus@switch:~$ sudo systemctl restart lldpd
To show the current LLDP mode, run the nv show service lldp command. The following example shows that the lldpd service sends CDPv2 frames only.
cumulus@leaf02:mgmt:~$ nv show service lldp
operational applied
------------------ ---------------- ----------------
dot1-tlv off off
mode force-send-cdpv2 force-send-cdpv2
tx-hold-multiplier 4 4
tx-interval 30 30
LLDP DCBX TLVs
DCBX is an extension of LLDP. Cumulus Linux supports DCBX TLVs to provide additional information in LLDP packets to peers, such as VLAN information and QoS. Adding QoS configuration as part of the DCBX TLVs allows automated configuration on hosts and switches that connect to the switch.
Cumulus Linux can send a maximum of 250 VLANS per switch port in one LLDP frame.
Cumulus Linux does not support CEE DCBX TLVs.
Cumulus Linux limits DCBX support to enabling DCBX TLVs (either with ROCE global configuration or per interface) as documented in the IEEE 802.1Q standard.
Cumulus Linux supports the following TLVs:
IEEE 802.1 TLVs
Name
Subtype
Description
Port VLAN ID
1
The port VLAN identifier.
VLAN Name
3
The name of any VLAN to which the port belongs.
Link Aggregation
7
Indicates if the port supports link aggregation and if it is on.
Cumulus Linux transmits the following 802.3 TLVs by default. You do not need to enable them.
Name
Subtype
Description
Link Aggregation
3
Indicates if the port supports link aggregation and if it is on.
Maximum Frame Size
4
The MTU configuration on the port. The MTU on the port is the MFS.
Transmit IEEE 802.1 TLVs
You can transmit the 802.1 TLV types (VLAN name, Port VLAN ID, and IEEE 802.1 Link Aggregation) when exchanging LLDP messages. By default, 802.1 TLV transmission is off and the switch sends all LLDP frames without 802.1 TLVs.
To enable 802.1 TLV transmission, run the nv set service lldp dot1-tlv on command:
cumulus@switch:~$ nv set service lldp dot1-tlv on
cumulus@switch:~$ nv config apply
Transmit QoS TLVs
You can enable QoS TLV transmission (ETS Configuration, ETS Recommendation, PFC Configuration) on an interface. By default, all QoS TLV transmission is off on all interfaces.
Adding the QoS TLVs to LLDP packets on an interface relies on PFC and ETS configuration from switchd. Refer to Quality of Service for information on configuring PFC and ETS.
QoS TLV transmission (PFC Configuration, ETS Configuration, and ETS Recommendation) is on globally for all ports, which overrides any QoS TLV transmission setting on a switch port interface.
LLDP frames for all switch port interfaces carry PFC configuration, ETS configuration, ETS recommendation, and APP Priority TLVs. The ETS configuration and PFC configuration TLV payloads are the same for all interfaces.
To enable PFC Configuration TLV transmission, run the nv set interface <interface> lldp dcbx-pfc-tlv on command:
cumulus@switch:~$ nv set interface swp1 lldp dcbx-pfc-tlv on
cumulus@switch:~$ nv config apply
To enable ETS Configuration TLV transmission, run the nv set interface <interface> lldp dcbx-ets-config-tlv on command:
cumulus@switch:~$ nv set interface swp1 lldp dcbx-ets-config-tlv on
cumulus@switch:~$ nv config apply
To enable ETS Recommendation TLV transmission, run the nv set interface <interface> lldp dcbx-ets-recomm-tlv on command:
cumulus@switch:~$ nv set interface swp1 lldp dcbx-ets-recomm-tlv on
cumulus@switch:~$ nv config apply
The interface must be a physical interface; you cannot enable TLVs on bonds.
Show DCBX TLV Settings
To show if IEEE 802.1 TLV transmission is on, run the NVUE nv show service lldp command:
cumulus@leaf01:mgmt:~$ nv show service lldp
operational applied description
------------------ ----------- ------- ----------------------------------------------------------------------
dot1-tlv on on Enable dot1 TLV advertisements on enabled ports
tx-hold-multiplier 4 4 < TTL of transmitted packets is calculated by multiplying the tx-in...
tx-interval 30 30 change transmit delay
To show if Qos TLV transmission is on for an interface, run the NVUE nv show interface <interface> command:
cumulus@leaf01:mgmt:~$ nv show interface swp1
operational applied description
------------------------ ----------------- ----------- ---------------------------------------------------
...
lldp
dcbx-ets-config-tlv on DCBX ETS config TLV flag
dcbx-ets-recomm-tlv off DCBX ETS recommendation TLV flag
dcbx-pfc-tlv on DCBX PFC TLV flag
...
Troubleshooting
You can use the lldpcli tool to query the lldpd daemon for neighbors, statistics, and other running configuration information. See man lldpcli(8) for details.
To show all neighbors on all ports and interfaces:
cumulus@switch:~$ sudo lldpcli show neighbors
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface: eth0, via: LLDP, RID: 1, Time: 0 day, 17:38:08
Chassis:
ChassisID: mac 08:9e:01:e9:66:5a
SysName: PIONEERMS22
SysDescr: Cumulus Linux version 4.1.0 running on quanta lb9
MgmtIP: 192.168.0.22
Capability: Bridge, on
Capability: Router, on
Port:
PortID: ifname swp47
PortDescr: swp47
-------------------------------------------------------------------------------
Interface: swp1, via: LLDP, RID: 10, Time: 0 day, 17:08:27
Chassis:
ChassisID: mac 00:01:00:00:09:00
SysName: MSP-1
SysDescr: Cumulus Linux version 4.1.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
MgmtIP: 192.0.2.9
MgmtIP: fe80::201:ff:fe00:900
Capability: Bridge, off
Capability: Router, on
Port:
PortID: ifname swp1
PortDescr: swp1
-------------------------------------------------------------------------------
Interface: swp2, via: LLDP, RID: 10, Time: 0 day, 17:08:27
Chassis:
ChassisID: mac 00:01:00:00:09:00
SysName: MSP-1
SysDescr: Cumulus Linux version 4.1.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
MgmtIP: 192.0.2.9
MgmtIP: fe80::201:ff:fe00:900
Capability: Bridge, off
Capability: Router, on
Port:
PortID: ifname swp2
PortDescr: swp2
-------------------------------------------------------------------------------
Interface: swp3, via: LLDP, RID: 11, Time: 0 day, 17:08:27
Chassis:
ChassisID: mac 00:01:00:00:0a:00
SysName: MSP-2
SysDescr: Cumulus Linux version 4.1.0 running on QEMU Standard PC (i440FX + PIIX, 1996)
MgmtIP: 192.0.2.10
MgmtIP: fe80::201:ff:fe00:a00
Capability: Bridge, off
Capability: Router, on
Port:
PortID: ifname swp1
PortDescr: swp1
...
To show a summary of lldpd statistics for all ports:
cumulus@switch:~$ sudo lldpcli show statistics summary
---------------------------------------------------------------------
LLDP Global statistics:
---------------------------------------------------------------------
Summary of stats:
Transmitted: 648186
Received: 437557
Discarded: 0
Unrecognized: 0
Ageout: 10
Inserted: 38
Deleted: 10
To show the running LLDP configuration:
cumulus@switch:~$ sudo lldpcli show running-configuration
--------------------------------------------------------------------
Global configuration:
--------------------------------------------------------------------
Configuration:
Transmit delay: 30
Transmit hold: 4
Receive mode: no
Pattern for management addresses: (none)
Interface pattern: (none)
Interface pattern blacklist: (none)
Interface pattern for chassis ID: (none)
Override description with: (none)
Override platform with: Linux
Override system name with: (none)
Advertise version: yes
Update interface descriptions: no
Promiscuous mode on managed interfaces: no
Disable LLDP-MED inventory: yes
LLDP-MED fast start mechanism: yes
LLDP-MED fast start interval: 1
Source MAC for LLDP frames on bond slaves: local
Portid TLV Subtype for lldp frames: ifname
--------------------------------------------------------------------
Considerations
Cumulus Linux does not support LLDP Annex E (and Annex D).
If you configure both an eth0 IP address and a loopback IP address on the switch, LLDP advertises the loopback IP address as the management IP address. In this case, the Cumulus Linux switch behaves more like a typical Linux host than a networking appliance.
Ethernet bridges enable hosts to communicate through layer 2 by connecting the physical and logical interfaces in the system into a single layer 2 domain. The bridge is a logical interface with a MAC address and an MTU. The bridge MTU is the minimum MTU among all its members.
When you configure a bridge with NVUE, Cumulus Linux automatically assigns a hardware address to the bridge. When you configure a bridge by editing the /etc/network/interfaces file, the bridge MAC address is the MAC address of the first port in the bridge-ports list in the /etc/network/interfaces file.
Bridge members can be individual physical interfaces, bonds, or logical interfaces that traverse an 802.1Q VLAN trunk.
Cumulus Linux does not put all ports into a bridge by default.
Ethernet Bridge Types
The Cumulus Linux bridge driver supports two configuration modes; one that is VLAN-aware and one that follows a more traditional Linux bridge model.
NVIDIA recommends that you use VLAN-aware mode bridges instead of traditional mode bridges. The Cumulus Linux bridge driver is capable of VLAN filtering, which allows for configurations that are similar to incumbent network devices. For a comparison of traditional and VLAN-aware modes, see
this knowledge base article.
You can configure both VLAN-aware and traditional mode bridges on the same network in Cumulus Linux.
The switch learns the MAC address for a frame when the frame enters the bridge through an interface and records the MAC address in the bridge table. The bridge forwards the frame to its intended destination by looking up the destination MAC address. Cumulus Linux maintains the MAC entry for 1800 seconds (30 minutes). If the switch sees the frame with the same source MAC address before the MAC entry age expires, it refreshes the MAC entry age; if the MAC entry age expires, the switch deletes the MAC address from the bridge table.
The following example NVUE command output shows a MAC address table for the bridge.
The Linux bridge fdb command interacts with the FDB, which the bridge uses to store the MAC addresses it learns and the ports on which it learns those MAC addresses. The bridge fdb show command output contains some specific keywords:
Keyword
Description
self
The FDB entry belongs to the FDB on the device referenced by the device. For example, this FDB entry belongs to the VXLAN device: vx-1000: 00:02:00:00:00:08 dev vx-1000 dst 27.0.0.10 self
master
The FDB entry belongs to the FDB on the device’s master and the FDB entry is pointing to a master’s port. For example, this FDB entry is from the master device named bridge and is pointing to the VXLAN bridge port: vx-1001: 02:02:00:00:00:08 dev vx-1001 vlan 1001 master bridge
extern_learn
An external control plane, such as the BGP control plane for EVPN, manages (offloads) the FDB entry.
The following example shows the bridge fdb show command output:
cumulus@switch:~$ bridge fdb show | grep 02:02:00:00:00:08
02:02:00:00:00:08 dev vx-1001 vlan 1001 extern_learn master bridge
02:02:00:00:00:08 dev vx-1001 dst 27.0.0.10 self extern_learn
02:02:00:00:00:08 is the MAC address learned with BGP EVPN.
The first FDB entry points to a Linux bridge entry that points to the VXLAN device vx-1001.
The second FDB entry points to the same entry on the VXLAN device and includes additional remote destination information.
The VXLAN FDB augments the bridge FDB with additional remote destination information.
All FDB entries that point to a VXLAN port appear as two entries. The second entry augments the remote destination information.
Considerations
A bridge cannot contain multiple subinterfaces of the same port. Attempting this configuration results in an error.
If you use both VLAN-aware and traditional bridges, if a traditional bridge includes a bond subinterface that is a normal interface in a VLAN-aware bridge, the bridge flaps when you bring down the bond subinterface in the traditional bridge.
You cannot enslave a VLAN raw device to a different master interface (you cannot edit the vlan-raw-device setting in the /etc/network/interfaces file). You need to delete the VLAN and recreate it.
Cumulus Linux enables MAC learning by default on traditional and VLAN-aware bridge interfaces. Do not disable MAC learning unless you are using EVPN. See Ethernet Virtual Private Network - EVPN.
VLAN-aware bridge mode in Cumulus Linux implements a configuration model for large-scale layer 2 environments, with one single instance of spanning tree protocol. Each physical bridge member port includes the list of allowed VLANs as well as the port VLAN ID, either the primary VLAN Identifier (PVID) or native VLAN. MAC address learning, filtering and forwarding are VLAN-aware. This reduces the configuration size, and eliminates the large overhead of managing the port and VLAN instances as subinterfaces, replacing them with lightweight VLAN bitmaps and state updates.
Cumulus Linux supports multiple VLAN-aware bridges but with the following limitations:
You cannot use MLAG with multiple VLAN-aware bridges.
You cannot use the same port with multiple VLAN-aware bridges.
You cannot use the same VNIs in multiple VLAN-aware bridges.
You cannot use VLAN translation with multiple VLAN-aware bridges.
You cannot use double tagged VLAN interfaces with multiple VLAN-aware bridges.
You cannot associate multiple single VXLAN devices (SVDs) with a single VLAN-aware bridge
Configure a VLAN-aware Bridge
The example commands below create a VLAN-aware bridge for STP that contains two switch ports and includes 3 VLANs; tagged VLANs 10 and 20, and untagged (native) VLAN 1.
With NVUE, there is a default bridge called br_default, which has no ports assigned. The example below configures this default bridge.
cumulus@switch:~$ nv set interface swp1-2 bridge domain br_default
cumulus@switch:~$ nv set bridge domain br_default vlan 10,20
cumulus@switch:~$ nv set bridge domain br_default untagged 1
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file and add the bridge:
Run the ifreload -a command to load the new configuration:
cumulus@switch:~$ ifreload -a
The Primary VLAN Identifier (PVID) of the bridge defaults to 1. You do not have to specify bridge-pvid for a bridge or a port. However, even though this does not affect the configuration, it helps other users for readability. The following configurations are identical to each other and the configuration above:
If you specify bridge-vids or bridge-pvid at the bridge level, all ports in the bridge inherit these configurations. However, specifying any of these settings for a specific port overrides the setting in the bridge.
Do not bridge the management port eth0 with any switch ports. For example, if you create a bridge with eth0 and swp1, the bridge does not work correctly and disrupts access to the management interface.
Configure Multiple VLAN-aware Bridges
This example shows the commands required to create two VLAN-aware bridges on the switch.
bridge1 bridges swp1 and swp2, and includes 2 VLANs; vlan 10 and vlan 20
bridge2 bridges swp3 and contains one VLAN; vlan 10
Bridges are independent so you can reuse VLANs between bridges. Each VLAN-aware bridge maintains its own MAC address and VLAN tag table; MAC and VLAN tags in one bridge are not visible to the other table.
cumulus@switch:~$ nv set interface swp1-2 bridge domain bridge1
cumulus@switch:~$ nv set bridge domain bridge1 vlan 10,20
cumulus@switch:~$ nv set bridge domain bridge1 untagged 1
cumulus@switch:~$ nv set interface swp3 bridge domain bridge2
cumulus@switch:~$ nv set bridge domain bridge2 vlan 10
cumulus@switch:~$ nv set bridge domain bridge2 untagged 1
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file and add the bridge:
Run the ifreload -a command to load the new configuration:
cumulus@switch:~$ ifreload -a
NVIDIA Spectrum 1 switches support a maximum of 10000 VLAN elements. NVIDIA Spectrum-2 switches and later support a maximum of 15996 VLAN elements when warm restart mode is off or 7934 VLAN elements when warm restart mode is on.
Cumulus Linux calculates the total number of VLAN elements as the number of VLANs times the number of configured bridges. For example, 6 bridges, each containing 2600 VLANs totals 15600 VLAN elements.
On NVIDIA Spectrum-2 switches and later, if you enable multiple VLAN-aware bridges and want to use more VLAN elements than the default, you must update the number of VLAN elements in the /etc/mlx/datapath/broadcast_domains.conf file.
To specify the total number of bridge domains you want to use, uncomment and edit the broadcast_domain.max_vlans parameter. The default value is 6143 when warm restart mode is off or 4096 when warm restart mode is on.
To specify the total number of subinterfaces you want to use, uncomment and edit the broadcast_domain.max_subinterfaces parameter. The default value is 3872 when warm restart mode is off or 1872 when warm restart mode is on.
You must restart switchd with the systemctl restart switchd command to apply the configuration.
The number of broadcast_domain.max_vlans plus broadcast_domain.max_subinterfaces cannot exceed 15996.
Increasing the broadcast_domain.max_vlans parameter can affect layer 2 multicast scale support.
Reserved VLAN Range
For hardware data plane internal operations, the switching silicon requires VLANs for every physical port, Linux bridge, and layer 3 subinterface. Cumulus Linux reserves a range of VLANs by default; the reserved range is 3725-3999.
If the reserved VLAN range conflicts with any user-defined VLANs, you can modify the range. The new range must be a contiguous set of VLANs with IDs between 2 and 4094. For a single VLAN-aware bridge, the minimum size of the range is 2 VLANs. For multiple VLAN-aware bridges, the minimum size of the range is the number of VLAN-aware bridges on the system plus one.
The following example changes the reserved VLAN range to be between 4064 and 4094:
cumulus@switch:~$ nv set system global reserved vlan internal range 4064-4094
cumulus@switch:~$ nv config apply
Edit the /etc/cumulus/switchd.conf file to uncomment the resv_vlan_range line and specify a new range.
cumulus@switch:~$ sudo nano /etc/cumulus/switchd.conf
...
# global reserved vlan internal range
resv_vlan_range = 4064-4094
After you save the file, you must restart switchd:
With VLAN-aware bridge mode, you can configure a switch port to drop any untagged frames. To do this, add bridge-allow-untagged no to the switch port (not to the bridge). The bridge port is without a PVID and drops untagged packets.
The following example command configures swp2 to drop untagged frames:
Edit the /etc/network/interfaces file to add the bridge-allow-untagged no line under the switch port interface stanza, then run the ifreload -a command.
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp1
iface swp1
auto swp2
iface swp2
bridge-allow-untagged no
auto br_default
iface br_default
bridge-ports swp1 swp2
bridge-pvid 1
bridge-vids 10 20
bridge-vlan-aware yes
...
cumulus@switch:~$ sudo ifreload -a
When you check VLAN membership for that port, it shows that there is no untagged VLAN.
When configuring the VLAN attributes for the bridge, specify the attributes for each VLAN interface. If you are configuring the switch virtual interface (SVI) for the native VLAN, you must declare the native VLAN and specify its IP address. Specifying the IP address in the bridge stanza itself returns an error.
The following example commands declare native VLAN 10 with IPv4 address 10.1.10.2/24 and IPv6 address 2001:db8::1/32.
The NVUE and Linux commands also show an example with multiple VLAN-aware bridges.
cumulus@switch:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@switch:~$ nv set interface vlan10 ip address 2001:db8::1/32
cumulus@switch:~$ nv config apply
cumulus@switch:~$ nv set interface bridge2_vlan10 type svi
cumulus@switch:~$ nv set interface bridge2_vlan10 vlan 10
cumulus@switch:~$ nv set interface bridge2_vlan10 base-interface bridge2
cumulus@switch:~$ nv set interface bridge2_vlan10 ip address 10.1.10.2/24
cumulus@switch:~$ nv set interface bridge1_vlan10 type svi
cumulus@switch:~$ nv set interface bridge1_vlan10 vlan 10
cumulus@switch:~$ nv set interface bridge1_vlan10 base-interface bridge1
cumulus@switch:~$ nv set interface bridge1_vlan10 ip address 12.1.10.2/24
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file, then run the ifreload -a command.
The first time you configure a switch, all southbound bridge ports are down; therefore, by default, the SVI is also down. You can force the SVI to always be up by disabling interface state tracking so that the SVI is always in the UP state, even if all member ports are down. Other implementations describe this feature as no autostate. This is beneficial if you want to perform connectivity testing.
To keep the SVI perpetually UP, create a dummy interface, then make the dummy interface a member of the bridge.
▼
Example Configuration
Consider the following configuration, without a dummy interface in the bridge:
With this configuration, when swp3 is down, the SVI is also down:
cumulus@switch:~$ ip link show swp3
5: swp3: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master br_default state DOWN mode DEFAULT group default qlen 1000
link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show br_default
35: br_default: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
Now add the dummy interface to your network configuration:
Edit the /etc/network/interfaces file and add the dummy interface stanza before the bridge stanza:
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto dummy
iface dummy
link-type dummy
auto br_default
iface br_default
...
Add the dummy interface to the bridge-ports line in the bridge configuration:
Save and exit the file, then reload the configuration:
cumulus@switch:~$ sudo ifreload -a
Now, even when swp3 is down, both the dummy interface and the bridge remain up:
cumulus@switch:~$ ip link show swp3
5: swp3: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master br_default state DOWN mode DEFAULT group default qlen 1000
link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show dummy
37: dummy: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue master br_default state UNKNOWN mode DEFAULT group default
link/ether 66:dc:92:d4:f3:68 brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show br_default
35: br_default: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 2c:60:0c:66:b1:7f brd ff:ff:ff:ff:ff:ff
IPv6 Link-local Address Generation
By default, Cumulus Linux automatically generates IPv6 link-local addresses on VLAN interfaces. If you want to use a different mechanism to assign link-local addresses, you can disable this feature. You can disable link-local automatic address generation for both regular IPv6 addresses and address-virtual (macvlan) addresses.
To disable automatic address generation for a regular IPv6 address on a VLAN, run the following command. The following example command disables automatic address generation for a regular IPv6 address on VLAN 10.
Cumulus Linux does not provide NVUE commands for this setting.
Edit the /etc/network/interfaces file to add the line ipv6-addrgen off to the VLAN stanza, then run the ifreload -a command.
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto vlan10
iface vlan 10
ipv6-addrgen off
vlan-id 10
vlan-raw-device br_default
...
cumulus@switch:~$ ifreload -a
To reenable automatic link-local address generation for a VLAN:
Cumulus Linux does not provide NVUE commands for this setting.
Edit the /etc/network/interfaces file to remove the line ipv6-addrgen off from the VLAN stanza, then run the ifreload -a command.
MAC Address for a Bridge
To configure a MAC address for a bridge, run the nv set bridge domain <bridge> mac-address <mac-address> command.
The following example configures the bridge br_default with MAC address 00:00:5E:00:53:00:
To unset the MAC address for a bridge, remove the MAC address from the bridge stanza and run the sudo ifreload -a command.
MAC Address Ageing
By default, Cumulus Linux stores MAC addresses in the Ethernet switching table for 1800 seconds (30 minutes). You can change this setting to a value between 0 and 65535. A value of 0 disables MAC learning and frames flood out of all ports in a VLAN.
The following command example changes the MAC ageing setting to 600 seconds:
To show the bridge ageing configuration setting, run the nv show bridge domain <domain> command or the Linux sudo ip -d link show <bridge-domain> command.
cumulus@switch:~$ nv show bridge domain br_default
operational applied
--------------- ----------- ----------
ageing 600
encap 802.1Q
mac-address auto
type vlan-aware
untagged 1
vlan-vni-offset 0
...
To reset bridge ageing to the default value (1800 seconds), run the nv unset bridge domain <domain> ageing command.
Static MAC Address Entries
You can add a static MAC address entry to the layer 2 table for an interface within the VLAN-aware bridge by running a command similar to the following:
cumulus@switch:~$ sudo bridge fdb add 12:34:56:12:34:56 dev swp1 vlan 150 master static sticky
cumulus@switch:~$ sudo bridge fdb show
44:38:39:00:00:7c dev swp1 master bridge permanent
12:34:56:12:34:56 dev swp1 vlan 150 sticky master bridge static
44:38:39:00:00:7c dev swp1 self permanent
12:12:12:12:12:12 dev swp1 self permanent
12:34:12:34:12:34 dev swp1 self permanent
12:34:56:12:34:56 dev swp1 self permanent
12:34:12:34:12:34 dev bridge master bridge permanent
44:38:39:00:00:7c dev bridge vlan 500 master bridge permanent
12:12:12:12:12:12 dev bridge master bridge permanent
Example Configuration
The following example configuration contains an access port (swp51), a trunk carrying all VLANs (swp3 thru swp48), and a trunk pruning some VLANs from a switch port (swp2).
cumulus@switch:mgmt:~$ nv set interface swp3-48 bridge domain br_default
cumulus@switch:mgmt:~$ nv set bridge domain br_default vlan 310,700,707,712,850,910
cumulus@switch:mgmt:~$ nv set interface swp1 bridge domain br_default access 310
cumulus@switch:mgmt:~$ nv set interface swp1 bridge domain br_default stp bpdu-guard on
cumulus@switch:mgmt:~$ nv set interface swp1 bridge domain br_default stp admin-edge on
cumulus@switch:mgmt:~$ nv set interface swp2 bridge domain br_default vlan 707,712,850
cumulus@switch:mgmt:~$ nv set interface swp2 bridge domain br_default stp admin-edge on
cumulus@switch:mgmt:~$ nv set interface swp2 bridge domain br_default stp bpdu-guard on
cumulus@switch:mgmt:~$ nv set interface swp49 bridge domain br_default stp network on
cumulus@switch:mgmt:~$ nv set interface swp50 bridge domain br_default stp network on
cumulus@switch:mgmt:~$ nv config apply
cumulus@switch:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
# the following is an access port
auto swp1
iface swp1
bridge-access 310
mstpctl-bpduguard yes
mstpctl-portadminedge yes
# the following is a trunk port that is pruned
# only .1q tags of 707, 712, 850 are sent and received
auto swp2
iface swp2
bridge-vids 707 712 850
mstpctl-bpduguard yes
mstpctl-portadminedge yes
...
# the following port is the trunk uplink and inherits all vlans
# from br_default; bridge assurance is enabled using portnetwork
auto swp49
iface swp49
mstpctl-portnetwork yes
# the following port is the trunk uplink and inherits all vlans
# from 'br_default'; bridge assurance is enabled using portnetwork
auto swp50
iface swp50
mstpctl-portnetwork yes
# ports swp3-swp48 are trunk ports that inherit vlans
# 310,700,707,712,850,910 from the bridge br_default
auto br_default
iface br_default
bridge-ports swp1 swp2 swp3... swp49 swp50
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 310 700 707 712 850 910
bridge-pvid 1
You cannot enable VLAN translation on a bridge in VLAN-aware mode. Only traditional mode bridges support VLAN translation.
Bridge Conversion
You cannot convert traditional mode bridges automatically to and from a VLAN-aware bridge. You must delete the original configuration and bring down all member switch ports before creating a new bridge.
Traditional Bridge Mode
For traditional Linux bridges, the kernel supports VLANs in the form of VLAN subinterfaces. When you enable bridging on multiple VLANs, you configure a bridge for each VLAN and create one or more VLAN subinterfaces for each member port on the bridge. This mode can pose scalability challenges with configuration size as well as boot time and run time state management when the number of ports times the number of VLANs becomes large.
Use VLAN-aware mode bridges instead of traditional mode bridges.
Use traditional mode bridges if you need to use PVSTP+.
Configure a Traditional Mode Bridge
The following example commands configure a traditional mode bridge called my_bridge, where swp1, swp2, swp3, and swp4 are members of the bridge. The example also configures the bridge with IP address 10.10.10.10/24 to provide IP access to the bridge interface.
Cumulus Linux does not provide NVUE commands for traditional bridge mode.
Edit the /etc/network/interfaces file, then run the ifreload -a command.
...
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto my_bridge
iface my_bridge
address 10.10.10.10/24
bridge-ports swp1 swp2 swp3 swp4
bridge-vlan-aware no
...
cumulus@switch:~$ sudo ifreload -a
Do not bridge the management port, eth0, with any switch ports (swp0, swp1, and so on). For example, if you create a bridge with eth0 and swp1, it does not work.
The name of the bridge must be compliant with Linux interface naming conventions and unique within the switch.
You can configure multiple bridges to divide a switch into multiple layer 2 domains. This enables hosts to communicate with other hosts in the same domain, while separating them from hosts in other domains.
The example below shows a multiple bridge configuration, where host-1 and host-2 connect to bridge-A, and host-3 and host-4 connect to bridge-B:
host-1 and host-2 can communicate with each other
host-3 and host-4 can communicate with each other
host-1 and host-2 cannot communicate with host-3 and host-4
This example configuration looks like this in the /etc/network/interfaces file:
...
auto bridge-A
iface bridge-A
bridge-ports swp1 swp2
bridge-vlan-aware no
auto bridge-B
iface bridge-B
bridge-ports swp3 swp4
bridge-vlan-aware no
...
Trunks in Traditional Bridge Mode
The standard for trunking is 802.1Q. The 802.1Q specification adds a four byte header within the Ethernet frame that identifies the VLAN of which the frame is a member.
802.1Q also identifies an untagged frame as belonging to the native VLAN (most network devices default their native VLAN to 1). In Cumulus Linux:
A trunk port is a switch port configured to send and receive 802.1Q tagged frames.
A switch sending an untagged (bare Ethernet) frame on a trunk port is sending from the native VLAN defined on the trunk port.
A switch sending a tagged frame on a trunk port is sending to the VLAN identified by the 802.1Q tag.
A switch receiving an untagged (bare Ethernet) frame on a trunk port places that frame in the native VLAN defined on the trunk port.
A switch receiving a tagged frame on a trunk port places that frame in the VLAN identified by the 802.1Q tag.
A bridge in traditional mode has no concept of trunks, just tagged or untagged frames. With a trunk of 200 VLANs, there needs to be 199 bridges, each containing a tagged physical interface, and one bridge containing the native untagged VLAN.
The interaction of tagged and untagged frames on the same trunk often leads to undesired and unexpected behavior. A switch that uses VLAN 1 for the native VLAN can send frames to a switch that uses VLAN 2 for the native VLAN, merging those two VLANs and their spanning tree state.
To create the above example:
Cumulus Linux does not provide NVUE commands for traditional bridge mode.
Add the following configuration to the /etc/network/interfaces file:
...
auto br-VLAN10
iface br-VLAN10
bridge-ports swp1.10 swp2.10
auto br-VLAN20
iface br-VLAN20
bridge-ports swp1.20 swp2.20
...
Advanced VLAN Tagging Example
The following advanced VLAN tagging configuration shows three hosts and two switches, with several bridges and a bond that connects them all.
host1 connects to bridge br-untagged with bare Ethernet frames and to bridge br-tag100 with 802.1q frames tagged for vlan100.
host2 connects to bridge br-tag100 with 802.1q frames tagged for vlan100 and to bridge br-vlan120 with 802.1q frames tagged for vlan120.
host3 connects to bridge br-vlan120 with 802.1q frames tagged for vlan120 and to bridge v130 with 802.1q frames tagged for vlan130.
bond2 carries tagged and untagged frames in this example.
The bridge member ports function as 802.1Q access ports and trunk ports. To compare Cumulus Linux with a traditional Cisco device:
swp1 is equivalent to a trunk port with untagged and vlan100.
swp2 is equivalent to a trunk port with vlan100 and vlan120.
swp3 is equivalent to a trunk port with vlan120 and vlan130.
bond2 is equivalent to an EtherChannel in trunk mode with untagged, vlan100, vlan120, and vlan130.
Bridges br-untagged, br-tag100, br-vlan120, and v130 are equivalent to SVIs (switched virtual interfaces).
To create the above configuration, edit the /etc/network/interfaces file and add a configuration like the following:
# Config for host1
# swp1 does not need an iface section unless it has a specific setting
# it will be picked up as a dependent of swp1.100
# swp1 must exist in the system to create the .1q subinterfaces
# but it is not applied to any bridge or assigned an address
auto swp1.100
iface swp1.100
# Config for host2
# swp2 does not need an iface section unless it has a specific setting
# it will be picked up as a dependent of swp2.100 and swp2.120
# swp2 must exist in the system to create the .1q subinterfaces
# but it is not applied to any bridge or assigned an address
auto swp2.100
iface swp2.100
auto swp2.120
iface swp2.120
# Config for host3
# swp3 does not need an iface section unless it has a specific setting
# it will be picked up as a dependent of swp3.120 and swp3.130
# swp3 must exist in the system to create the .1q subinterfaces
# but it is not applied to any bridge or assigned an address
auto swp3.120
iface swp3.120
auto swp3.130
iface swp3.130
# configure the bond
auto bond2
iface bond2
bond-slaves glob swp4-7
# configure the bridges
auto br-untagged
iface br-untagged
address 10.0.0.1/24
bridge-ports swp1 bond2
bridge-stp on
auto br-tag100
iface br-tag100
address 10.0.100.1/24
bridge-ports swp1.100 swp2.100 bond2.100
bridge-stp on
auto br-vlan120
iface br-vlan120
address 10.0.120.1/24
bridge-ports swp2.120 swp3.120 bond2.120
bridge-stp on
auto v130
iface v130
address 10.0.130.1/24
bridge-ports swp3.130 bond2.130
bridge-stp on
To verify the configuration:
cumulus@switch:~$ sudo mstpctl showbridge br-tag100
br-tag100 CIST info
enabled yes
bridge id 8.000.44:38:39:00:32:8B
designated root 8.000.44:38:39:00:32:8B
regional root 8.000.44:38:39:00:32:8B
root port none
path cost 0 internal path cost 0
max age 20 bridge max age 20
forward delay 15 bridge forward delay 15
tx hold count 6 max hops 20
hello time 2 ageing time 300
force protocol version rstp
time since topology change 333040s
topology change count 1
topology change no
topology change port swp2.100
last topology change port None
cumulus@switch:~$ sudo mstpctl showportdetail br-tag100 | grep -B 2 state
br-tag100:bond2.100 CIST info
enabled yes role Designated
port id 8.003 state forwarding
--
br-tag100:swp1.100 CIST info
enabled yes role Designated
port id 8.001 state forwarding
--
br-tag100:swp2.100 CIST info
enabled yes role Designated
port id 8.002 state forwarding
A single bridge cannot contain multiple subinterfaces of the same port. If you try to apply this configuration, you see an error:
cumulus@switch:~$ sudo brctl addbr another_bridge
cumulus@switch:~$ sudo brctl addif another_bridge swp9 swp9.100
bridge cannot contain multiple subinterfaces of the same port: swp9, swp9.100
VLAN Translation
By default, Cumulus Linux does not allow VLAN subinterfaces associated with different VLAN IDs to be part of the same bridge. Base interfaces do not associate with any VLAN IDs and are exempt from this restriction.
In some cases, it is useful to relax this restriction. For example, when two servers connect to the switch using VLAN trunks, but the VLAN numbering on the two servers is not consistent. You can bridge two VLAN subinterfaces of different VLAN IDs from the servers by enabling the sysctl net.bridge.bridge-allow-multiple-vlans option. Packets that enter a bridge from a member VLAN subinterface egress another member VLAN subinterface with the VLAN ID translated.
The following example enables the VLAN translation sysctl:
After you enable sysctl, you can add ports with different VLAN IDs to the same bridge. In the following example, the switch bridges packets that enter the bridge br-mix from swp10.100 to swp11.200. Cumulus Linux translates the VLAN ID from 100 to 200:
cumulus@switch:~$ sudo brctl addif br_mix swp10.100 swp11.200
cumulus@switch:~$ sudo brctl show br_mix
bridge name bridge id STP enabled interfaces
br_mix 8000.4438390032bd yes swp10.100
swp11.200
Spanning Tree and Rapid Spanning Tree - STP
STP identifies links in the network and shuts down redundant links, preventing possible network loops and broadcast radiation on a bridged network. STP also provides redundant links for automatic failover when an active link fails. Cumulus Linux enables STP by default for both VLAN-aware and traditional bridges.
PVST creates a spanning tree instance for a bridge. PVRST supports RSTP enhancements for each spanning tree instance. To use PVRST with a traditional bridge, you must create a bridge corresponding to the untagged native VLAN and all the physical switch ports must be part of the same VLAN.
For maximum interoperability, when connected to a switch that has a native VLAN configuration, you must configure the native VLAN to VLAN 1.
STP for a VLAN-aware Bridge
VLAN-aware bridges operate in RSTP mode only. RSTP on VLAN-aware bridges works with other modes in the following ways:
RSTP and STP
If a bridge running RSTP (802.1w) receives a common STP (802.1D) BPDU, it falls back to 802.1D automatically.
RSTP and PVST
The RSTP domain sends BPDUs on the native VLAN, whereas PVST sends BPDUs on a per VLAN basis. For both protocols to work together, you need to enable the native VLAN on the link between the RSTP to PVST domain; the spanning tree builds according to the native VLAN parameters.
The RSTP protocol does not send or parse BPDUs on other VLANs, but floods BPDUs across the network, enabling the PVST domain to maintain its spanning-tree topology and provide a loop-free network.
To enable proper BPDU exchange across the network, be sure to allow all VLANs participating in the PVST domain on the link between the RSTP and PVST domains.
When using RSTP together with an existing PVST network, you need to define the root bridge on the PVST domain. Either lower the priority on the PVST domain or change the priority of the RSTP switches to a higher number.
When connecting a VLAN-aware bridge to a proprietary PVST+ switch using STP, you must allow VLAN 1 on all 802.1Q trunks that interconnect them, regardless of the configured native VLAN. Only VLAN 1 enables the switches to address the BPDU frames to the IEEE multicast MAC address.
RSTP and MST
RSTP works with MST seamlessly, creating a single instance of spanning tree that transmits BPDUs on the native VLAN.
RSTP treats the MST domain as one giant switch, whereas MST treats the RSTP domain as a different region. To ensure proper communication between the regions, MST creates a CST that connects all the boundary switches and forms the overall view of the MST domain. Because changes in the CST must reflect in all regions, the RSTP tree exists is in the CST to ensure that changes on the RSTP domain are in the CST domain. Topology changes on the RSTP domain impact the rest of the network but inform the MST domain of every change occurring in the RSTP domain, ensuring a loop-free network.
Configure the root bridge within the MST domain by changing the priority on the relevant MST switch. When MST detects an RSTP link, it falls back into RSTP mode. The MST domain chooses the switch with the lowest cost to the CST root bridge as the CIST root bridge.
RSTP with MLAG
More than one spanning tree instance enables switches to load balance and use different links for different VLANs. With RSTP, there is only one instance of spanning tree. To better utilize the links, you can configure MLAG on the switches connected to the MST or PVST domain and set up these interfaces as an MLAG port. The PVST or MST domain thinks it connects to a single switch and utilizes all the links connected to it. Load balancing depends on the port channel hashing mechanism instead of different spanning tree instances and uses all the links between the RSTP to the PVST or MST domains. For information about configuring MLAG, see Multi-Chassis Link Aggregation - MLAG.
Configure STP
There several ways to customize STP in Cumulus Linux. Exercise caution when changing the settings below to prevent malfunctions in STP loop avoidance.
Spanning Tree Priority
If you have a multiple spanning tree instance (MSTI 0, also known as a common spanning tree, or CST), you can set the tree priority for a bridge. The bridge with the lowest priority is the root bridge. The priority must be a number between 0 and 61440, and must be a multiple of 4096. The default is 32768.
If you are running MLAG and have multiple bridges, the STP priority must be the same on all bridges on both peer switches.
The following example command sets the tree priority to 8192:
Configure the tree priority (mstpctl-treeprio) under the bridge stanza in the /etc/network/interfaces file, then run the ifreload -a command.
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bridge
iface bridge
# bridge-ports includes all ports related to VxLAN and CLAG.
# does not include the Peerlink.4094 subinterface
bridge-ports bond01 bond02 peerlink vni13 vni24 vxlan4001
bridge-pvid 1
bridge-vids 13 24
bridge-vlan-aware yes
mstpctl-treeprio 8192
...
cumulus@switch:~$ ifreload -a
Cumulus Linux supports MSTI 0 only. It does not support MSTI 1 through 15.
PortAdminEdge (PortFast Mode)
PortAdminEdge is equivalent to the PortFast feature offered by other vendors. It enables or disables the initial edge state of a port in a bridge. All ports with PortAdminEdge bypass the listening and learning states and go straight to forwarding.
PortAdminEdge mode causes loops if you do not use it with BPDU guard.
You typically configure edge ports as access ports for a simple end host. In the data center, edge ports connect to servers, which pass both tagged and untagged traffic.
The following example commands configure PortAdminEdge and BPDU guard for swp5:
cumulus@switch:~$ nv set interface swp5 bridge domain br_default stp admin-edge on
cumulus@switch:~$ nv set interface swp5 bridge domain br_default stp bpdu-guard on
cumulus@switch:~$ nv config apply
Configure PortAdminEdge and BPDU guard under the switch port interface stanza in the /etc/network/interfaces file, then run the ifreload -a command.
PortAutoEdge is an enhancement to the standard PortAdminEdge (PortFast) mode, which allows for the automatic detection of edge ports. PortAutoEdge enables and disables the auto transition to and from the edge state of a port in a bridge.
Edge ports and access ports are not the same. Edge ports transition directly to the forwarding state and skip the listening and learning stages. Upstream topology change notifications are not generated when an edge port link changes state. Access ports only forward untagged traffic; however, there is no such restriction on edge ports, which can forward both tagged and untagged traffic.
When a port with PortAutoEdge receives a BPDU, the port stops being in the edge port state and transitions into a normal STP port. When the interface no longer receives BPDUs, the port becomes an edge port, and transitions through the discarding and learning states before it resumes forwarding.
Cumulus Linux enables PortAutoEdge by default.
The following example commands disable PortAutoEdge on swp1:
cumulus@switch:~$ nv set interface swp1 bridge domain br_default stp auto-edge off
cumulus@switch:~$ nv config apply
Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-portautoedge no line, then run the ifreload -a command.
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp1
iface swp1
alias to Server01
# Port to Server02
mstpctl-portautoedge no
...
cumulus@switch:~$ sudo ifreload -a
The following example commands reenable PortAutoEdge on swp1:
cumulus@switch:~$ nv set interface swp1 bridge domain br_default stp auto-edge on
cumulus@switch:~$ nv config apply
Edit the switch port interface stanza in the /etc/network/interfaces file to remove mstpctl-portautoedge no, then run the ifreload -a command.
BPDU Guard
You can configure BPDU guard to protect the spanning tree topology from an unauthorized device affecting the forwarding path. For example, if you add a new host to an access port off a leaf switch and the host sends an STP BPDU, BPDU guard protects against undesirable topology changes in the environment.
The following example commands set BPDU guard for swp5:
cumulus@switch:~$ nv set interface swp5 bridge domain br_default stp bpdu-guard on
cumulus@switch:~$ nv config apply
Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-bpduguard yes line, then run the ifreload -a command.
To see if a port has BPDU guard on or if the port receives a BPDU:
cumulus@switch:~$ nv show bridge domain br_default stp
cumulus@switch:~$ mstpctl showportdetail br_default
bridge:swp5 CIST info
enabled no role Disabled
port id 8.001 state discarding
external port cost 305 admin external cost 0
internal port cost 305 admin internal cost 0
designated root 8.000.6C:64:1A:00:4F:9C dsgn external cost 0
dsgn regional root 8.000.6C:64:1A:00:4F:9C dsgn internal cost 0
designated bridge 8.000.6C:64:1A:00:4F:9C designated port 8.001
admin edge port no auto edge port yes
oper edge port no topology change ack no
point-to-point yes admin point-to-point auto
restricted role no restricted TCN no
port hello time 10 disputed no
bpdu guard port yes bpdu guard error yes
network port no BA inconsistent no
Num TX BPDU 3 Num TX TCN 2
Num RX BPDU 488 Num RX TCN 2
Num Transition FWD 1 Num Transition BLK 2
bpdufilter port no
clag ISL no clag ISL Oper UP no
clag role unknown clag dual conn mac 0:0:0:0:0:0
clag remote portID F.FFF clag system mac 0:0:0:0:0:0
If a port receives a BPDU, it goes into a protodown state, which results in a local OPER DOWN (carrier down) on the interface. Cumulus Linux also sets the protodown reason as bpduguard and records a log message in /var/log/syslog.
To show the reason for the port protodown, run the ip -p -j link show <interface> command.
cumulus@switch:~$ ip -p -j link show swp5
To recover from the protodown state, remove the protodown reason and protodown from the interface with the mstpctl clearbpduguardviolation <bridge> <interface> command.
Bringing up the disabled port does not correct the problem if the configuration on the connected end station does not resolve.
Bridge Assurance
On a point-to-point link where RSTP is running, if you want to detect unidirectional links and put the port in a discarding state, you can enable bridge assurance on the port by enabling a port type network. The port is then in a bridge assurance inconsistent state until it receives a BPDU from the peer. You need to configure the port type network on both ends of the link for bridge assurance.
Cumulus Linux disables bridge assurance by default.
The following example commands enable bridge assurance on swp1:
cumulus@switch:~$ nv set interface swp5 bridge domain br_default stp network on
cumulus@switch:~$ nv config apply
Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-portnetwork yes line, then run the ifreload -a command.
You can enable bpdufilter on a switch port, which filters BPDUs in both directions. This disables STP on the port as no BPDUs are transiting.
Using BDPU filter sometimes causes layer 2 loops. Use this feature with caution.
The following example commands configure the BPDU filter on swp6:
cumulus@switch:~$ nv set interface swp6 bridge domain br_default stp bpdu-filter on
cumulus@switch:~$ nv config apply
Edit the switch port interface stanza in the /etc/network/interfaces file to add the mstpctl-portbpdufilter yes line, then run the ifreload -a command.
The table below describes additional STP configuration parameters available in Cumulus Linux. You can set these optional parameters manually by editing the /etc/network/interfaces file. Cumulus Linux does not provide NVUE commands for these parameters.
The IEEE 802.1D and 802.1Q specifications describe STP parameters. For a comparison of STP parameter configuration between mstpctl and other vendors, read this knowledge base article.
Parameter
Description
mstpctl-maxage
Sets the maximum age of the bridge in seconds. The default is 20. The maximum age must meet the condition 2 * (Bridge Forward Delay - 1 second) >= Bridge Max Age. Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-ageing
Sets the MAC address ageing time for the bridge in seconds when the running version is STP, but not RSTP or MSTP. The default is 1800. Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-fdelay
Sets the bridge forward delay time in seconds. The default value is 15. The bridge forward delay must meet the condition 2 * (Bridge Forward Delay - 1 second) >= Bridge Max Age. Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-maxhops
Sets the maximum hops for the bridge. The default is 20. Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-txholdcount
Sets the bridge transmit hold count. The default value is 6 seconds. Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-forcevers
Sets the force STP version of the bridge to either RSTP/STP. The default is RSTP. Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-hello
Sets the bridge hello time in seconds. The default is 2. Add this parameter to the bridge stanza of the /etc/network/interfaces file.
mstpctl-portpathcost
Sets the port cost of the interface in the bridge. The default is 0. mstpd supports only long mode; 32 bits for the path cost. Add this parameter to the interface stanza of the /etc/network/interfaces file.
mstpctl-portp2p
Enables or disables point-to-point detection mode of the interface in the bridge. Add this parameter to the interface stanza of the /etc/network/interfaces file.
mstpctl-portrestrtcn
Enables or disables the interface in the bridge to propagate received topology change notifications. The default is no. Add this parameter to the interface stanza of the /etc/network/interfaces file.
mstpctl-treeportcost
Sets the spanning tree port cost to a value from 0 to 255. The default is 0. Add this parameter to the interface stanza of the /etc/network/interfaces file.
Be sure to run the sudo ifreload -a command after you set the STP parameter in the /etc/network/interfaces file.
Troubleshooting
To check STP status for a bridge:
cumulus@switch:~$ nv show bridge domain br_default stp
operational applied description
-------- ----------- ------- ---------------------------------------------------------------------
priority 32768 32768 stp priority. The priority value must be a number between 4096 and...
state up up The state of STP on the bridge
The mstpctl utility provided by the mstpd service configures STP. The mstpd daemon is an open source project used by Cumulus Linux to implement IEEE802.1D 2004 and IEEE802.1Q 2011.
The mstpd daemon starts by default when the switch boots and logs errors to /var/log/syslog.
mstpd is the preferred utility for interacting with STP on Cumulus Linux. brctl also provides certain tools for configuring STP; however, they are not as complete and output from brctl is sometimes misleading.
To show the bridge state, run the brctl show command:
cumulus@switch:~$ sudo brctl show
bridge name bridge id STP enabled interfaces
bridge 8000.001401010100 yes swp1
swp4
swp5
To show the mstpd bridge port state, run the mstpctl showport bridge command:
Storm control provides protection against excessive inbound BUM (broadcast, unknown unicast, multicast) traffic on layer 2 switch port interfaces, which can cause poor network performance.
Configure Storm Control
To configure storm control settings, you can either run NVUE commands or manually edit the /etc/cumulus/switchd.conf file.
The following command example enables broadcast storm control for swp4 at 400 packets per second (pps), multicast storm control at 3000 pps, and unknown unicast at 2000 pps.
cumulus@switch:~$ nv set interface swp4 storm-control broadcast 400
cumulus@switch:~$ nv set interface swp4 storm-control multicast 3000
cumulus@switch:~$ nv set interface swp4 storm-control unknown-unicast 2000
cumulus@switch:~$ nv config apply
The storm control settings require a switchd reload. Before applying the settings, NVUE indicates if it requires a switchd reload and prompts you for confirmation. When the switchd service reloads, there is no interruption to network services.
The following example command disables multicast storm control on swp4:
Edit the /etc/cumulus/switchd.conf file and uncomment the storm_control.broadcast, storm_control.multicast, and storm_control.unknown_unicast lines:
cumulus@switch:~$ sudo nano /etc/cumulus/switchd.conf
...
# Storm Control setting on a port, in pps
interface.swp4.storm_control.broadcast = 400
interface.swp4.storm_control.multicast = 3000
interface.swp4.storm_control.unknown_unicast = 2000
...
When you change the storm control settings, you must reload switchd with the sudo systemctl reload switchd.service command for the changes to take effect. The reload does not interrupt network services.
Show Storm Control Settings
To show the current storm control settings for a layer 2 interface, run the nv show interface <interface> storm-control command.
cumulus@switch:~$ nv show interface swp4 storm-control
applied description
--------------- ------- ----------------------------------------------------------
broadcast 400 Configure storm control for broadcast traffic in pps
multicast 3000 Configure storm control for multicast traffic in pps
unknown-unicast 2000 Configure storm control for unknown unicast traffic in pps
Bonding - Link Aggregation
Linux bonding provides a way to aggregate multiple network interfaces (slaves) into a single logical bonded interface (bond). Link aggregation is useful for linear scaling of bandwidth, load balancing, and failover protection.
Cumulus Linux supports two bonding modes:
IEEE 802.3ad link aggregation mode combines one or more links to form a link aggregation group (LAG) so that a media access control (MAC) client can treat the group as a single link. IEEE 802.3ad link aggregation is the default mode.
Balance-xor mode balances outgoing traffic across active ports according to the hashed protocol header information and accepts incoming traffic from any active port. All slave interfaces are active for load balancing and fault tolerance. This is useful for MLAG deployments.
Cumulus Linux uses version 1 of the LAG control protocol (LACP).
NVUE does not accept a bond name starting with an interface type ID, such as sw, eth, vlan, lo, ib, fnm, or vrrp. For example, you cannot name a bond login123, eth2, sw1, or vlan10.
An interface cannot belong to multiple bonds.
A bond can have subinterfaces, but subinterfaces cannot have a bond.
A bond cannot enslave VLAN subinterfaces.
All slave ports within a bond must have the same speed or duplex and match the slave ports of the link partner.
Create a Bond
To create a bond, specify the bond members. In the example below, the front panel port interfaces swp1 thru swp4 are members of bond1 but swp5 and swp6 are not part of bond1.
cumulus@switch:~$ nv set interface bond1 bond member swp1-4
cumulus@switch:~$ nv config apply
In NVUE, if you create the bond interface with a name that starts with bond, NVUE automatically sets the interface type to bond. If you create a bond interface with a name that does not start with bond, you must set the interface type to bond with the nv set interface <interface-name> type bond command.
Edit the /etc/network/interfaces file to add a stanza for the bond, then run the ifreload -a command.
By default, the bond uses IEEE 802.3ad link aggregation mode. To configure the bond in balance-xor mode, see Optional Configuration below.
If the bond is not going to be part of a bridge, you must specify an IP address.
Make sure the name of the bond adheres to Linux interface naming conventions and is unique within the switch.
To temporarily bring up a bond even when there is no LACP partner, use LACP Bypass.
When you start networking, the switch creates bond1 as MASTER and interfaces swp1 thru swp4 come up in SLAVE mode:
cumulus@switch:~$ ip link show
...
3: swp1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond1 state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
4: swp2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond1 state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
5: swp3: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond1 state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
6: swp4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond1 state UP mode DEFAULT qlen 500
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
...
55: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT
link/ether 44:38:39:00:03:c1 brd ff:ff:ff:ff:ff:ff
All slave interfaces within a bond have the same MAC address as the bond. Typically, the first slave you add to the bond donates its MAC address as the bond MAC address. The bond MAC address is the source MAC address for all traffic leaving the bond and provides a single destination MAC address to address traffic to the bond.
Removing a bond slave interface from which a bond derives its MAC address affects traffic when the bond interface flaps to update the MAC address.
Optional Configuration
You can set these configuration options for a bond.
Option
Description
Link aggregation mode
Cumulus Linux supports IEEE 802.3ad link aggregation mode (802.3ad) and balance-xor mode. The default mode is 802.3ad. Set balance-xor mode only if you cannot use LACP; LACP can detect mismatched link attributes between bond members and can even detect misconnections.
When you use balance-xor mode to dual-connect host-facing bonds in an MLAG environment, you must configure the MLAG ID with the same value on both MLAG switches. Otherwise, the MLAG switch pair treats the bonds as single-connected.
MII link monitoring frequency
How often (in milliseconds) you want to inspect the link state of each slave for failures. You can specify a value between 0 and 255. The default value is 100.
miimon link status mode
The miimon link status mode. You can set the mode to either netif_carrier_ok(), or MII or ethtool ioctls. The default setting is netif_carrier_ok().
LACP bypass
Set LACP bypass on a bond in 802.3ad mode so that it becomes active and forwards traffic even when there is no LACP partner. You can specify on or off. The default setting is off. See LACP Bypass.
Transmit rate
The rate at which the link partner transmits LACP control packets. You can specify slow or fast. The default setting is fast.
Minimum number of links
The minimum number of links that must be active before the bond goes into service. You can set a value between 0 and 255. The default value is 1, which indicates that the bond must have at least one active member.
Use a value greater than 1 if you need higher level services to ensure a minimum aggregate bandwidth level before activating a bond.
If the number of active members drops below this setting, the bond appears to upper-level protocols as link-down. When the number of active links returns to greater than or equal to this value, the bond becomes link-up.
Cumulus Linux sets the bond configuration options to the recommended values by default; use caution when changing settings.
To set the link aggregation mode on bond1 to balance-xor mode:
cumulus@switch:~$ nv set interface bond1 bond mode static
cumulus@switch:~$ nv config apply
To reset the link aggregation mode for bond1 to the default value of 802.3ad, run the nv set interface bond1 bond mode lacp command.
Edit the /etc/network/interfaces file and add the balance-xor parameter to the bond stanza, then run the ifreload -a command:
The switch distributes egress traffic through a bond to a slave based on a packet hash calculation, providing load balancing over the slaves; the switch distributes conversation flows over all available slaves to load balance the total traffic. Traffic for a single conversation flow always hashes to the same slave. In a failover event, the switch adjusts the hash calculation to steer traffic over available slaves.
The hash calculation uses packet header data to choose to which slave to transmit the packet:
For IP traffic, the switch uses IP header source and destination fields in the calculation.
For IP and TCP or UDP traffic, the switch includes source and destination ports in the hash calculation.
For load balancing between multiple interfaces that are members of the same bond, you can hash on these fields:
Field
Default Setting
NVUE Command
traffic.conf
IP protocol
on
nv set system forwarding lag-hash ip-protocol on
nv set system forwarding lag-hash ip-protocol off
lag_hash_config.ip_prot
Source MAC address
on
nv set system forwarding lag-hash source-mac on
nv set system forwarding lag-hash source-mac off
lag_hash_config.smac
Destination MAC address
on
nv set system forwarding lag-hash destination-mac on
nv set system forwarding lag-hash destination-mac off
lag_hash_config.dmac
Source IP address
on
nv set system forwarding lag-hash source-ip on
nv set system forwarding lag-hash source-ip off
lag_hash_config.sip
Destination IP address
on
nv set system forwarding lag-hash destination-ip on
nv set system forwarding lag-hash destination-ip off
lag_hash_config.dip
Source port
on
nv set system forwarding lag-hash source-port on
nv set system forwarding lag-hash source-port off
lag_hash_config.sport
Destination port
on
nv set system forwarding lag-hash destination-port on
nv set system forwarding lag-hash destination-port off
The following example commands omit the source MAC address and destination MAC address from the hash calculation:
cumulus@switch:~$ nv set system forwarding lag-hash source-mac off
cumulus@switch:~$ nv set system forwarding lag-hash destination-mac off
cumulus@switch:~$ nv config apply
Use the instructions below when NVUE is not enabled. If you are using NVUE to configure your switch, the NVUE commands change the settings in /etc/cumulus/datapath/nvue_traffic.conf which takes precedence over the settings in /etc/cumulus/datapath/traffic.conf.
Edit the /etc/cumulus/datapath/traffic.conf file:
Uncomment the lag_hash_config.enable option.
Set the lag_hash_config.smac and lag_hash_config.dmac options to false.
cumulus@switch:~$ sudo nano /etc/cumulus/datapath/traffic.conf
...
#LAG HASH config
#HASH config for LACP to enable custom fields
#Fields will be applicable for LAG hash
#calculation
#Uncomment to enable custom fields configured below
lag_hash_config.enable = true
lag_hash_config.smac = false
lag_hash_config.dmac = false
lag_hash_config.sip = true
lag_hash_config.dip = true
lag_hash_config.ether_type = true
lag_hash_config.vlan_id = true
lag_hash_config.sport = true
lag_hash_config.dport = true
lag_hash_config.ip_prot = true
#GTP-U teid
lag_hash_config.gtp_teid = false
...
Run the echo 1 > /cumulus/switchd/ctrl/hash_config_reload command. This command does not cause any traffic interruptions.
Cumulus Linux enables symmetric hashing by default. Make sure that the settings for the source IP and destination IP fields match, and that the settings for the source port and destination port fields match; otherwise Cumulus Linux disables symmetric hashing automatically. If necessary, you can disable symmetric hashing manually in the /etc/cumulus/datapath/traffic.conf file by setting symmetric_hash_enable = FALSE.
You can also set a unique hash seed for each switch to avoid hash polarization. See Unique Hash Seed.
GTP Hashing
GTP carries mobile data within the core of the mobile operator’s network. Traffic in the 5G Mobility core cluster, from cell sites to compute nodes, have the same source and destination IP address. The only way to identify individual flows is with the GTP TEID. Enabling GTP hashing adds the TEID as a hash parameter and helps the Cumulus Linux switches in the network to distribute mobile data traffic evenly across ECMP routes.
Cumulus Linux supports TEID-based load balancing for traffic egressing a bond and is only applicable if the outer header egressing the port is GTP encapsulated and if the ingress packet is either a GTP-U packet or a VXLAN encapsulated GTP-U packet.
Cumulus Linux supports GTP Hashing on NVIDIA Spectrum-2 and later.
cumulus@switch:~$ nv set system forwarding lag-hash gtp-teid on
cumulus@switch:~$ nv config apply
To disable TEID-based load balancing, run the nv set system forwarding lag-hash gtp-teid off command.
Use the instructions below when NVUE is not enabled. If you are using NVUE to configure your switch, the NVUE commands change the settings in /etc/cumulus/datapath/nvue_traffic.conf which takes precedence over the settings in /etc/cumulus/datapath/traffic.conf.
Edit the /etc/cumulus/datapath/traffic.conf file:
Uncomment the hash_config.enable = true line.
Change the lag_hash_config.gtp_teid parameter to true.
To disable TEID-based load balancing, set the lag_hash_config.gtp_teid parameter to false, then reload the configuration.
Troubleshooting
To show information for a bond, run the NVUE nv show interface <bond> bond command:
cumulus@leaf01:mgmt:~$ nv show interface bond1 bond
operational applied description
----------- ----------- ------- ------------------------------------------------------
down-delay 0 0 bond down delay
lacp-bypass on on lacp bypass
lacp-rate fast fast lacp rate
mode lacp bond mode
up-delay 0 0 bond up delay
[member] swp1 swp1 Set of bond members
mlag
enable on Turn the feature 'on' or 'off'. The default is 'off'.
id 1 1 MLAG id
status single Mlag Interface status
You can also run the Linux sudo cat /proc/net/bonding/<bond> command:
cumulus@leaf01:mgmt:~$ sudo cat /proc/net/bonding/bond1
...
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0
802.3ad info
LACP rate: fast
Min links: 1
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 44:38:39:be:ef:aa
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 1
Actor Key: 9
Partner Key: 1
Partner Mac Address: 00:00:00:00:00:00
Slave Interface: swp1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 44:38:39:00:00:37
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: none
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 2
...
To show specific bond information, use the nv show interface <bond> <option> commands:
cumulus@switch:~$ nv show interface bond1 TAB
acl bridge ip lldp ptp router
bond evpn link pluggable qos
cumulus@leaf02:mgmt:~$ nv show interface bond1 link
operational applied description
--------------------- ----------------- ------- ----------------------------------------------------------------------
auto-negotiate off on Link speed and characteristic auto negotiation
duplex full full Link duplex
fec auto Link forward error correction mechanism
mtu 9000 9000 interface mtu
speed 1G auto Link speed
dot1x
mab off bypass MAC authentication
parking-vlan off VLAN for unauthorized MAC addresses
state up up The state of the interface
stats
carrier-transitions 1 Number of times the interface state has transitioned between up and...
in-bytes 0 Bytes total number of bytes received on the interface
in-drops 0 number of received packets dropped
in-errors 0 number of received packets with errors
in-pkts 0 total number of packets received on the interface
out-bytes 3.65 MB total number of bytes transmitted out of the interface
out-drops 0 The number of outbound packets that were chosen to be discarded eve...
out-errors 0 The number of outbound packets that could not be transmitted becaus...
out-pkts 51949 total number of packets transmitted out of the interface
mac 44:38:39:00:00:37 MAC Address on an interface
MLAG or CLAG: Other vendors refer to the Cumulus Linux implementation of MLAG as CLAG, MC-LAG or VPC. You even see references to CLAG in Cumulus Linux, including the management daemon, named clagd, and other options in the code, such as clag-id, which exist for historical purposes. The Cumulus Linux implementation is truly a multi-chassis link aggregation protocol so this document uses MLAG.
MLAG enables a server or switch with a two-port bond, such as a link aggregation group (LAG), EtherChannel, port group or trunk, to connect those ports to different switches and operate as if they connect to a single, logical switch. This provides greater redundancy and greater system throughput.
Dual-connected devices can create LACP bonds that contain links to each physical switch; Cumulus Linux supports active-active links from the dual-connected devices even though they connect to two different physical switches.
How Does MLAG Work?
A basic MLAG configuration looks like this:
The two switches, leaf01 and leaf02, known as peer switches, appear as a single device to the bond on server01.
server01 distributes traffic between the two links to leaf01 and leaf02 in the way you configure on the host.
Traffic inbound to server01 can traverse leaf01 or leaf02 and arrive at server01.
More elaborate configurations are also possible. The number of links between the host and the switches can be greater than two and does not have to be symmetrical. Also, because the two peer switches appear as a single switch to other bonding devices, you can also connect pairs of MLAG switches to each other in a switch-to-switch MLAG configuration:
leaf01 and leaf02 are also MLAG peer switches and present a two-port bond from a single logical system to spine01 and spine02.
LACP and Dual-connected Links
Link Aggregation Control Protocol (LACP), the IEEE standard protocol for managing bonds, verifies dual-connectedness. LACP runs on the dual-connected devices and on each of the MLAG peer switches. On a dual-connected device, the only configuration requirement is to create a bond that LACP manages.
On each of the peer switches, you must place the links that connect to the dual-connected host or switch in the bond. This is true even if the links are a single port on each peer switch, where each port is in a bond, as shown below:
The dual-connected bonds on the peer switches have their system ID set to the MLAG system ID. Therefore, from the point of view of the hosts, each of the links in its bond connects to the same system and so the host uses both links.
Each peer switch periodically makes a list of the LACP partner MAC addresses for its bonds and sends that list to its peer (using the clagd service). The LACP partner MAC address is the MAC address of the system at the other end of a bond (server01, server02, and server03 in the figure above). When a switch receives this list from its peer, it compares the list to the LACP partner MAC addresses on its switch. If there are any matches and the clag-id for those bonds match, then that bond is a dual-connected bond.
Requirements
MLAG has these requirements:
The two peer switches with MLAG must be directly connected. This is typically a bond for increased reliability and bandwidth.
There must be only two peer switches in one MLAG configuration, but you can have multiple configurations in a network for switch-to-switch MLAG.
Both switches in the MLAG pair must be identical; they must both be the same model of switch and run the same Cumulus Linux release. See Upgrading Cumulus Linux.
The dual-connected devices (servers or switches) can use LACP (IEEE 802.3ad or 802.1ax) to form the bond. In this case, the peer switches must also use LACP.
MLAG is not supported in a multiple VLAN-aware bridge configuration.
Both MLAG peers must use the same VXLAN device type (single or traditional).
Basic Configuration
To configure MLAG, you need to create a bond that uses LACP on the dual-connected devices and configure the interfaces (including bonds, VLANs, bridges, and peer links) on each peer switch. Follow these steps on each peer switch in the MLAG pair:
On the dual-connected device, such as a host or server that sends traffic to and from the switch, create a bond that uses LACP. The method you use varies with the type of device you are configuring.
If you cannot use LACP in your environment, you can configure the bonds in balance-xor mode.
Place every interface that connects to the MLAG pair from a dual-connected device into a bond, even if the bond contains only a single link on a single physical switch.
The following examples place swp1 in bond1 and swp2 in bond2.
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond1 description bond1-on-swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond2 description bond2-on-swp1
cumulus@leaf01:~$ nv config apply
Add the following lines to the /etc/network/interfaces file. The example also adds a description for the bonds (an alias), which is optional.
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto bond1
iface bond1
alias bond1 on swp1
bond-slaves swp1
...
auto bond2
iface bond2
alias bond2 on swp2
bond-slaves swp2
...
Add a unique MLAG ID to each bond.
You must specify a unique MLAG ID (clag-id) for every dual-connected bond on each peer switch so that switches know which links dual-connect or connect to the same host or switch. The value must be between 1 and 65535 and must be the same on both peer switches. A value of 0 disables MLAG on the bond.
The example commands below add an MLAG ID of 1 to bond1 and 2 to bond2:
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf01:~$ nv config apply
In the /etc/network/interfaces file, add the line clag-id 1 to the auto bond1 stanza and clag-id 2 to auto bond2 stanza:
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto bond1
iface bond1
alias bond1 on swp1
bond-slaves swp1
clag-id 1
auto bond2
iface bond2
alias bond2 on swp2
bond-slaves swp2
clag-id 2
...
Add the bonds you created above to a bridge. The example commands below add bond1 and bond2 to a VLAN-aware bridge.
You must add all VLANs configured on the MLAG bond to the bridge so that traffic to the downstream device connected in MLAG redirects over the peerlink in case the MLAG bond fails.
Create the inter-chassis bond and the peer link VLAN (as a VLAN subinterface). You also need to provide the peer link IP address, the MLAG bond interfaces, the MLAG system MAC address, and the backup interface.
By default, Cumulus Linux configures the inter-chassis bond with the name peerlink and the peer link VLAN with the name peerlink.4094. Use peerlink.4094 to ensure that the VLAN is independent of the bridge and spanning tree forwarding decisions.
The peer link IP address is a link-local address that provides layer 3 connectivity between the peer switches.
NVIDIA provides a reserved range of MAC addresses for MLAG (between 44:38:39:ff:00:00 and 44:38:39:ff:ff:ff). Use a MAC address from this range to prevent conflicts with other interfaces in the same bridged network.
Do not to use a multicast MAC address.
Do not use the same MAC address for different MLAG pairs; make sure you specify a different MAC address for each MLAG pair in the network.
The backup IP address is any layer 3 backup interface for the peer link, which the switch uses when the peer link goes down. You must add the backup IP address, which must be different than the peer link IP address. Make sure that any route that does not use the peer link can reach the backup IP address. Use the loopback or management IP address of the switch.
▼
Loopback or Management IP Address?
If your MLAG configuration has bridged uplinks (such as a campus network or a large, flat layer 2 network), use the peer switch eth0 address. When the peer link is down, the secondary switch routes towards the eth0 address using the OOB network (provided you have implemented an OOB network).
If your MLAG configuration has routed uplinks (a modern approach to the data center fabric network), use the peer switch loopback address. When the peer link is down, the secondary switch routes towards the loopback address using uplinks (towards the spine layer). If the primary switch also has a more significant problem (for example, switchd does not respond or stops), the secondary switch promotes itself to primary and the traffic flows.
When using BGP, to ensure IP connectivity between the loopbacks, the MLAG peer switches must use unique BGP ASNs; if they use the same ASN, you must bypass the BGP loop prevention check on the AS_PATH attribute.
The following examples show commands for both MLAG peers (leaf01 and leaf02).
cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:~$ nv config apply
To configure the backup link to a VRF, include the name of the VRF with the backup-ip parameter. The following example configures the backup link to VRF mgmt:
cumulus@leaf02:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf02:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:~$ nv set mlag backup 10.10.10.1
cumulus@leaf02:~$ nv set mlag peer-ip linklocal
cumulus@leaf02:~$ nv config apply
To configure the backup link to a VRF, include the name of the VRF with the backup-ip parameter. The following example configures the backup link to VRF mgmt:
Edit the /etc/network/interfaces file to add the following parameters, then run the sudo ifreload -a command.
The inter-chasis bond (peerlink) with two ports in the bond (swp49 and swp50 in the example command below)
The peerlink bond to the bridge
The peer link VLAN (peerlink.4094) with the backup IP address, the peer link IP address (linklocal), and the MLAG system MAC address (from the reserved range of addresses).
To configure the backup link to a VRF, include the name of the VRF with the clagd-backup-ip parameter. The following example configures the backup link to VRF RED:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto peerlink.4094
iface peerlink.4094
clagd-backup-ip 10.10.10.2 vrf RED
clagd-peer-ip linklocal
clagd-sys-mac 44:38:39:BE:EF:AA
...
Run the sudo ifreload -a command to apply all the configuration changes:
To configure the backup link to a VRF, include the name of the VRF with the clagd-backup-ip parameter. The following example configures the backup link to VRF RED:
cumulus@leaf02:~$ sudo nano /etc/network/interfaces
...
auto peerlink.4094
iface peerlink.4094
clagd-backup-ip 10.10.10.1 vrf RED
clagd-peer-ip linklocal
clagd-sys-mac 44:38:39:BE:EF:AA
...
Run the sudo ifreload -a command to apply all the configuration changes:
cumulus@leaf02:~$ sudo ifreload -a
Do not add VLAN 4094 to the bridge VLAN list; You cannot configure VLAN 4094 for the peer link subinterface as a bridged VLAN with bridge VIDs under the bridge.
Do not use 169.254.0.1 as the MLAG peer link IP address; Cumulus Linux uses this address for BGP unnumbered interfaces.
When you configure MLAG manually in the /etc/network/interfaces file, the changes take effect when you bring the peer link interface up with the sudo ifreload -a command. Do not use systemctl restart clagd.service to apply the new configuration.
The MLAG bond does not support layer 3 configuration.
MLAG synchronizes the dynamic state between the two peer switches but it does not synchronize the switch configurations. After modifying the configuration of one peer switch, you must make the same changes to the configuration on the other peer switch. This applies to all configuration changes, including:
Port configuration, such as VLAN membership, MTU and bonding parameters.
Bridge configuration, such as spanning tree parameters or bridge properties.
Static address entries, such as static FDB entries and static IGMP entries.
QoS configuration, such as ACL entries.
Optional Configuration
This section describes optional configuration procedures.
Set Roles and Priority
Each MLAG-enabled switch in the pair has a role. When the peering relationship establishes between the two switches, one switch goes into the primary role and the other into the secondary role. When an MLAG-enabled switch is in the secondary role, it does not send STP BPDUs on dual-connected links; it only sends BPDUs on single-connected links. The switch in the primary role sends STP BPDUs on all single- and dual-connected links.
By default, the switch determines the role by comparing the MAC addresses of the two sides of the peering link; the switch with the lower MAC address assumes the primary role. You can override this by setting the priority option for the peer link:
cumulus@leaf01:~$ nv set mlag priority 2084
cumulus@leaf01:~$ nv config apply
Edit the /etc/network/interfaces file and add the clagd-priority option, then run the ifreload -a command.
The switch with the lower priority value is in the primary role; the default value is 32768 and the range is between 0 and 65535.
When the MLAG service exits during switch reboot or if you stop the service on the primary switch, the peer switch that is in the secondary role becomes the primary.
However, if the primary switch goes down without stopping the MLAG service or if the peer link goes down, the secondary switch does not change its role. If the peer switch is not alive, the switch in the secondary role rolls back the LACP system ID to be the bond interface MAC address instead of the MLAG system MAC address (clagd-sys-mac). The switch in the primary role uses the MLAG system MAC address as the LACP system ID on the bonds.
Set clagctl Timers
The clagd service has several timers that you can tune for enhanced performance.
Timer
Description
--reloadTimer <seconds>
The number of seconds to wait for the peer switch to become active. If the peer switch does not become active after the timer expires, the MLAG bonds leave the initialization (protodown) state and become active. This provides clagd with sufficient time to determine whether the peer switch is coming up or if it is permanently unreachable. The default is 300 seconds.
--peerTimeout <seconds>
The number of seconds clagd waits without receiving any messages from the peer switch before it determines that the peer is no longer active. At this point, the switch reverts all configuration changes so that it operates as a standard non-MLAG switch. This includes removing all statically assigned MAC addresses, clearing the egress forwarding mask, and allowing addresses to move from any port to the peer port. After a message is again received from the peer, MLAG operation restarts. If this parameter is not specified, clagd uses ten times the local lacpPoll value.
--initDelay <seconds>
The number of seconds clagd delays bringing up MLAG bonds and anycast IP addresses. The default is 180 seconds. NVIDIA recommends you set this parameter to 300 seconds in a scaled environment. This timer sets to 0 automatically under the following conditions:
When the peer is not alive and the backup link is not active after a reload timeout
When the peer sends a goodbye (through the peer link or the backup link)
When both MLAG sessions come up at the same time
--sendTimeout <seconds>
The number of seconds clagd waits until the sending socket times out. If it takes longer than the sendTimeout value to send data to the peer, clagd generates an exception. The default is 30 seconds.
--lacpPoll <seconds>
The number of seconds clagd waits before obtaining local LACP information. The default is 2 seconds.
The only timer you can set with NVUE is the initial delay timer. The following example NVUE Command sets the initial delay to 100 seconds:
cumulus@leaf01:~$ nv set mlag init-delay 100
cumulus@leaf01:~$ nv config apply
To set the clagd timers, edit the /etc/network/interfaces file to add the clagd-args --<timer> line to the peerlink.4094 stanza, then run the ifreload -a command.
The following example command sets the initial delay timer to 100 seconds:
To configure MLAG with a traditional mode bridge instead of a VLAN-aware mode bridge, you must configure the peer link and all dual-connected links as untagged (native) ports on a bridge (note the absence of any VLANs in the bridge-ports line and the lack of the bridge-vlan-aware parameter below):
...
auto br0
iface br0
bridge-ports peerlink bond1 bond2
...
The following example shows you how to allow VLAN 10 across the peer link:
...
auto br0.10
iface br0.10
bridge-ports peerlink.10 bond1.10 bond2.10
bridge-stp on
...
In an MLAG and traditional bridge configuration, NVIDIA recommends that you set bridge learning to off on all VLANs over the peerlink except for the layer 3 peer link subinterface; for example:
...
auto peerlink
iface peerlink
bridge-learning off
auto peerlink.1510
iface peerlink.1510
bridge-learning off
auto peerlink.4094
iface peerlink.4094
...
Configure a Backup UDP Port
By default, Cumulus Linux uses UDP port 5342 with the backup IP address. To change the backup UDP port, edit the /etc/network/interfaces file to add clagd-args --backupPort <port> to the auto peerlink.4094 stanza. For example:
Run the sudo ifreload -a command to apply all the configuration changes:
cumulus@leaf01:~$ sudo ifreload -a
Unconfigure MLAG
To unconfigure MLAG:
Run the following commands to unset MLAG, and unset the peerlink and the peerlink VLAN subinterface that Cumulus Linux creates automatically. You must run the commands at the same time with the nv config apply command.
Remove the auto peerlink stanza; for example, remove lines similar to the following:
...
auto peerlink
iface peerlink
bond-slaves swp49 swp50
auto peerlink.4094
iface peerlink.4094
clagd-backup-ip 10.10.10.2
clagd-peer-ip linklocal
clagd-sys-mac 44:38:39:BE:EF:AA
...
Remove the clag-id line from the bond stanzas. In the following example, remove clag-id 1 from the auto bond1 stanza and clag-id 2 from the auto bond2 stanza:
...
auto bond1
iface bond1
alias bond1 on swp1
bond-slaves swp1
clag-id 1
auto bond2
iface bond2
alias bond2 on swp2
bond-slaves swp2
clag-id 2
...
Remove peerlink from the bridge-ports line of the bridge stanza. In the following example, remove peerlink from the auto br_default stanza:
auto br_default
iface br_default
bridge-ports bond1 bond2 peerlink
bridge-vlan-aware yes
Run the sudo ifreload -a command:
cumulus@leaf01:~$ sudo ifreload -a
Best Practices
Follow these best practices when configuring MLAG on your switches.
MTU and MLAG
The bridge MTU determines the MTU in MLAG traffic. The lowest MTU setting of an interface that is a member of the bridge determines the bridge MTU. If you want to set an MTU other than the default of 9216 bytes, you must configure the MTU on each physical interface and the bond interface that is a member of every MLAG bridge in the entire bridged domain.
The following example commands set an MTU of 1500 for each of the bond interfaces (peer link, uplink, bond1, bond2), which are members of bridge bridge:
cumulus@leaf01:~$ nv set interface peerlink.4094 link mtu 1500
cumulus@leaf01:~$ nv set interface uplink link mtu 1500
cumulus@leaf01:~$ nv set interface bond1 link mtu 1500
cumulus@leaf01:~$ nv set interface bond2 link mtu 1500
cumulus@leaf01:~$ nv config apply
Edit the /etc/network/interfaces file, then run the ifreload -a command. For example:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto br_default
iface br_default
bridge-ports peerlink uplink bond1 bond2
auto peerlink
iface peerlink
mtu 1500
auto bond1
iface bond1
mtu 1500
auto bond2
iface bond2
mtu 1500
auto uplink
iface uplink
mtu 1500
...
cumulus@leaf01:~$ sudo ifreload -a
STP and MLAG
Always enable STP in your layer 2 network and BPDU Guard on the host-facing bond interfaces.
The STP global configuration must be the same on both peer switches.
The STP configuration for dual-connected ports must be the same on both peer switches.
The STP priority must be the same on both peer switches.
In a multiple bridge configuration, the STP priority must be the same on all bridges on both peer switches.
To minimize convergence times when a link transitions to the forwarding state, configure the edge ports (for tagged and untagged frames) with PortAdminEdge and BPDU guard enabled.
Do not use a multicast MAC address for the LACP ID on systems connected to MLAG bonds; the switch drops STP BPDUs from a multicast MAC address.
Peer Link Sizing
The peer link carries little traffic when compared to the bandwidth consumed by data plane traffic. In a typical MLAG configuration, most connections between the two switches in the MLAG pair are dual-connected; the only traffic going across the peer link is traffic from the clagd process and some LLDP or LACP traffic. The switch does not forward traffic received on the peer link out of the dual-connected bonds.
However, there are some instances where a host connects to only one switch in the MLAG pair; for example:
You have a hardware limitation on the host where there is only one PCIE slot, and therefore, one NIC on the system, so the host is only single-connected across that interface.
The host does not support 802.3ad and you cannot create a bond on it.
You are accounting for a link failure, where the host becomes single connected until the failure resolves.
Determine how much bandwidth is traveling across the single-connected interfaces and set half of that bandwidth to the peer link. On average, one half of the traffic destined to the single-connected host arrives on the switch directly connected to the single-connected host and the other half arrives on the switch that is not directly connected to the single-connected host. When this happens, only the traffic that arrives on the switch that is not directly connected to the single-connected host needs to traverse the peer link.
In addition, you can add extra links to the peer link bond to handle link failures in the peer link bond itself.
Each host has two 10G links, with each 10G link going to each switch in the MLAG pair.
Each host has 20G of dual-connected bandwidth; all three hosts have a total of 60G of dual-connected bandwidth.
Set at least 15G of bandwidth for each peer link bond, which represents half of the single-connected bandwidth.
When planning for link failures for a full rack, you need only set enough bandwidth to meet your site strategy for handling failure scenarios. For example, for a full rack with 40 servers and two switches, you can plan for four to six servers to lose connectivity to a single switch and become single connected before you respond to the event. Therefore, if you have 40 hosts each with 20G of bandwidth dual-connected to the MLAG pair, you can set between 20G and 30G of bandwidth to the peer link, which accounts for half of the single-connected bandwidth for four to six hosts.
Peer Link Routing
When enabling a routing protocol in an MLAG environment, it is also necessary to manage the uplinks; by default MLAG is not aware of layer 3 uplink interfaces. If there is a peer link failure, MLAG does not remove static routes or bring down a BGP or OSPF adjacency unless you use a separate link state daemon such as ifplugd.
When you use MLAG with VRR, set up a routed adjacency across the peerlink.4094 interface. If a routed connection is not built across the peer link, during an uplink failure on one of the switches in the MLAG pair, egress traffic does not forward if the destination is on the switch whose uplinks are down.
To set up the adjacency, configure a BGP or OSPF unnumbered peering, as appropriate for your network.
For switches with the Spectrum ASIC, the MLAG loop avoidance mechanism also drops routed traffic that arrives on an MLAG peer link interface and routes to a dual-connected VNI. If you need to route unencapsulated traffic to an MLAG peer switch for VXLAN forwarding to accommodate uplink failures or other design needs, configure a routing adjacency across a separate routed interface that is not the MLAG peerlink.
Switches with the Spectrum-2 ASIC and later allow packets arriving on the peer link to route to a VNI for VXLAN encapsulation.
cumulus@leaf01:~$ nv set interface peerlink.4094 router ospf area 0.0.0.1
cumulus@leaf01:~$ nv config apply
MLAG Routing Support
In addition to the routing adjacency over the peer link, Cumulus Linux supports routing adjacencies from attached network devices to MLAG switches under the following conditions:
The router must physically attach to a single interface of a switch.
The attached router must peer directly to a local address on the physically connected switch.
The router cannot:
Attach to the switch over a MLAG bond interface.
Form routing adjacencies to a virtual address (VRR or VRRP).
Troubleshooting
Use the following troubleshooting tips to check MLAG configuration.
Check MLAG Status
To verify MLAG configuration, run the nv show mlag command:
cumulus@leaf01:mgmt:~$ nv show mlag
operational applied description
-------------- ----------------------- ----------------- ------------------------------------------------------
enable on Turn the feature 'on' or 'off'. The default is 'off'.
debug off Enable MLAG debugging
init-delay 100 The delay, in seconds, before bonds are brought up.
mac-address 44:38:39:be:ef:aa 44:38:39:BE:EF:AA Override anycast-mac and anycast-id
peer-ip fe80::4638:39ff:fe00:5a linklocal Peer Ip Address
priority 32768 32768 Mlag Priority
[backup] 10.10.10.2 10.10.10.2 Set of MLAG backups
backup-active False Mlag Backup Status
backup-reason Mlag Backup Reason
local-id 44:38:39:00:00:59 Mlag Local Unique Id
local-role primary Mlag Local Role
peer-alive True Mlag Peer Alive Status
peer-id 44:38:39:00:00:5a Mlag Peer Unique Id
peer-interface peerlink.4094 Mlag Peerlink Interface
peer-priority 32768 Mlag Peer Priority
peer-role secondary Mlag Peer Role
Run the net show mlag command or the clagctl command to show the MLAG interface information:
cumulus@leaf01:mgmt:~$ net show clag
The peer is alive
Our Priority, ID, and Role: 32768 44:38:39:00:00:11 primary
Peer Priority, ID, and Role: 32768 44:38:39:00:00:12 secondary
Peer Interface and IP: peerlink.4094 fe80::4638:39ff:fe00:12 (linklocal)
Backup IP: 10.10.10.2 (active)
System MAC: 44:38:39:be:ef:aa
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
bond1 bond1 1 - -
bond2 bond2 2 - -
bond3 bond3 3 - -
Show All MLAG Settings
To see all MLAG settings, run the clagctl params command:
By default, when running, the clagd service logs status messages to the /var/log/clagd.log file and to syslog:
cumulus@spine01:~$ sudo tail /var/log/clagd.log
2016-10-03T20:31:50.471400+00:00 spine01 clagd[1235]: Initial config loaded
2016-10-03T20:31:52.479769+00:00 spine01 clagd[1235]: The peer switch is active.
2016-10-03T20:31:52.496490+00:00 spine01 clagd[1235]: Initial data sync to peer done.
2016-10-03T20:31:52.540186+00:00 spine01 clagd[1235]: Role is now primary; elected
2016-10-03T20:31:54.250572+00:00 spine01 clagd[1235]: HealthCheck: role via backup is primary
2016-10-03T20:31:54.252642+00:00 spine01 clagd[1235]: HealthCheck: backup active
2016-10-03T20:31:54.537967+00:00 spine01 clagd[1235]: Initial data sync from peer done.
2016-10-03T20:31:54.538435+00:00 spine01 clagd[1235]: Initial handshake done.
2016-10-03T22:47:35.255317+00:00 spine01 clagd[1235]: leaf01-02 is now dual connected.
Monitor the clagd Service
Due to the critical nature of the clagd service, systemd continuously monitors its status by receiving notify messages every 30 seconds. If the clagd service terminates or becomes unresponsive for any reason and systemd receives no messages after 60 seconds, systemd restarts the clagd service. systemd logs these failures in the /var/log/syslog file and, on the first failure, also generates a cl-supportfile.
Monitoring occurs automatically as long as:
You enable the clagd service.
You configure the peer IP address (clagd-peer-ip), the MLAG system MAC address (clagd-sys-mac), and the backup IP address (clagd-backup-ip) for an interface.
The clagd service is running. If you stop clagd with the systemctl stop clagd.service command, clagd monitoring also stops.
You can check if clagd is running with the systemctl status command:
cumulus@leaf01:~$ systemctl status clagd.service
● clagd.service - Cumulus Linux Multi-Chassis LACP Bonding Daemon
Loaded: loaded (/lib/systemd/system/clagd.service; enabled)
Active: active (running) since Fri 2021-06-11 16:17:19 UTC; 12min ago
Docs: man:clagd(8)
Main PID: 27078 (clagd)
CGroup: /system.slice/clagd.service
└─27078 /usr/bin/python3 /usr/sbin/clagd --daemon linklocal peerlink.4094 44:38:39:BE:EF:AA --priority 32768
Peer Link Consistency Check
When you make an MLAG configuration change, Cumulus Linux automatically validates the corresponding parameters on both MLAG peers and takes action based on the type of conflict it sees. For every conflict, the /var/log/clagd.log file records a log message.
The following table shows the conflict types and actions that Cumulus Linux takes.
Conflict
Type
Action
Bridge STP mode
Global
Protodown only the MLAG bonds on the secondary switch when there is an STP mode mismatch across peers.
MLAG native VLAN
Interface
Protodown only the MLAG bonds on the secondary switch when there is a native VLAN mismatch.
STP root bridge priority
Global
Protodown the MLAG bonds and VNIs on the secondary switch when there is an STP priority mismatch across peers.
MLAG system MAC address
Global
Protodown the MLAG bonds and VNIs on the secondary switch when there is an MLAG system MAC address mismatch across peers.
Peer IP
Global
Protodown the MLAG bonds and VNIs on the secondary switch when there is an IP address mismatch within the same subnet between peers. The consistency checker does not trigger an IP address mismatch between the linklocal keyword and a static IPv4 address, or between IPv4 addresses across subnets.
Peer link MTU
Global
Protodown the MLAG bonds and VNIs on the secondary switch when there is a peer link MTU mismatch across peers.
Peer link native VLAN
Global
Protodown the MLAG bonds and VNIs on the secondary switch when there is a peer link VLAN mismatch across peers. Protodown the MLAG bonds and VNIs on the secondary switch when there is no PVID.
VXLAN anycast IP address
Global
Protodown the MLAG bonds and VNIs on the secondary switch when there is an anycast IP address mismatch across peers. Protodown the MLAG bonds and VNIs on the node where there is no configured anycast IP address.
Peer link bridge member
Global
Protodown the MLAG bonds and VNIs on the MLAG switch where there is a peer link bridge member conflict.
The peer value always displays NOT-SYNCED for this consistency check because Cumulus Linux does not enforce the same interface name for the peerlink and because of limitations with traditional bridges.
MLAG bond bridge member
Interface
Protodown the MLAG bonds and VNIs on the MLAG switch if the MLAG bond is not a bridge member.
The peer value always displays NOT-SYNCED for this consistency check because Cumulus Linux does not enforce the same interface name for the peerlink and because of limitations with traditional bridges.
LACP partner MAC address
Interface
Protodown the MLAG bonds on the MLAG switch if there is an LACP partner MAC address mismatch or if there is a duplicate LACP partner MAC address.
MLAG VLANs
Interface
Suspend the inconsistent VLANs on either MLAG peer if the VLANs are not part of the peer link or if there is mismatch of VLANs configured on the MLAG bonds between the MLAG peers.
Peer link VLANs
Global
Suspend the inconsistent VLANs on either MLAG peer on all the dual-connected MLAG bonds and VXLAN interfaces.
MLAG protocol version
Global
The consistency check records an MLAG protocol version mismatch between the MLAG peers. Cumulus Linux does not take any disruptive action.
MLAG package version
Global
The consistency check records an MLAG package version mismatch between the MLAG peers. Cumulus Linux does not take any disruptive action.
You can also manually check for MLAG inconsistencies with the following commands:
The following example command shows global MLAG settings for each peer and indicates that the MLAG system MAC address does not match.
The actions that Cumulus Linux takes when there is a conflict are disruptive. If you prefer, you can configure the switch to not take any action when there is a conflict. Edit the /etc/network/interfaces file to add the clagd-args --gracefulConsistencyCheck FALSE parameter in the peer link stanza.
You can expect a large volume of packet drops across one of the peer link interfaces. These drops serve to prevent looping of BUM (broadcast, unknown unicast, multicast) packets. When the switch receives a packet across the peer link, if the destination lookup results in an egress interface that is a dual-connected bond, the switch does not forward the packet (to prevent loops). The peer link records a dropped packet.
To check packet drops across peer link interfaces, run the ethtool -S <interface> command:
In addition to the standard UP and DOWN administrative states, an interface that is a member of an MLAG bond can also be in a protodown state. When MLAG detects a problem that can result in connectivity issues, it puts that interface into protodown state. Such connectivity issues include:
When the peer link goes down but the peer switch is up (the backup link is active).
When the bond has an MLAG ID but the clagd service is not running (you either stop the service or it crashes).
When an MLAG-enabled node boots or reboots, the switch puts the MLAG bonds in a protodown state until the node establishes a connection to its peer switch, or after five minutes.
When an interface goes into a protodown state, it results in a local OPER DOWN (carrier down) on the interface.
To show an interface in protodown state, run the Linux ip link show command or the net show bridge link command. For example:
cumulus@leaf01:mgmt:~$ ip link show
3: swp1 state DOWN: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 9216 master pfifo_fast master host-bond1 state DOWN mode DEFAULT qlen 500 protodown on
link/ether 44:38:39:00:69:84 brd ff:ff:ff:ff:ff:ff
LACP Partner MAC Address Duplicate or Mismatch
Cumulus Linux puts interfaces in a protodown state under the following conditions:
When there is an LACP partner MAC address mismatch. For example if a bond comes up with a clag-id and the peer is using a bond with the same clag-id but a different LACP partner MAC address. The NVUE nv show mlag lacp-conflict or the Linux clagctl command output shows the protodown reason as a partner-mac-mismatch.
When there is a duplicate LACP partner MAC address. For example, when there are multiple LACP bonds between the same two LACP endpoints. The NVUE nv show mlag lacp-conflict or the Linux clagctl command output shows the protodown reason as a duplicate-partner-mac.
To prevent a bond from coming up when an MLAG bond with an LACP partner MAC address already in use comes up, use the --clag-args --allowPartnerMacDup False option. This option puts the slaves of that bond interface in a protodown state and the clagctl output shows the protodown reason as a duplicate-partner-mac.
After you make the necessary cable or configuration changes to avoid the protodown state and you want MLAG to reevaluate the LACP partners, run the NVUE nv action clear mlag lacp-conflict command or the Linux clagctl clearconflictstate command to remove duplicate-partner-mac or partner-mac-mismatch from the protodown bonds, allowing them to come back up.
Configuration Example
The example below shows a basic MLAG configuration, where:
leaf01 and leaf02 are MLAG peers
MLAG is on three bonds, each with a single port, a peer link that is a bond with two member ports, and three VLANs on each port
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp1-3,swp49-51
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond3 bond member swp3
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf01:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf01:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@leaf01:~$ nv set interface vlan20 ip address 10.1.20.2/24
cumulus@leaf01:~$ nv set interface vlan30 ip address 10.1.30.2/24
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf01:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:~$ nv set mlag init-delay 100
cumulus@leaf01:~$ nv config apply
cumulus@leaf02:~$ nv set interface lo ip address 10.10.10.2/32
cumulus@leaf02:~$ nv set interface swp1-3,swp49-51
cumulus@leaf02:~$ nv set interface bond1 bond member swp1
cumulus@leaf02:~$ nv set interface bond2 bond member swp2
cumulus@leaf02:~$ nv set interface bond3 bond member swp3
cumulus@leaf02:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf02:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf02:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf02:~$ nv set interface vlan10 ip address 10.1.10.3/24
cumulus@leaf02:~$ nv set interface vlan20 ip address 10.1.20.3/24
cumulus@leaf02:~$ nv set interface vlan30 ip address 10.1.30.3/24
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf02:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf02:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf02:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:~$ nv set mlag backup 10.10.10.1
cumulus@leaf02:~$ nv set mlag peer-ip linklocal
cumulus@leaf02:~$ nv set mlag init-delay 100
cumulus@leaf02:~$ nv config apply
cumulus@spine01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp1-2
cumulus@spine01:~$ nv config apply
auto lo
iface lo inet loopback
address 10.10.10.1/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto bond1
iface bond1
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 1
auto bond2
iface bond2
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 2
auto bond3
iface bond3
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 3
auto vlan10
iface vlan10
address 10.1.10.2/24
hwaddress 44:38:39:22:01:b1
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.2/24
hwaddress 44:38:39:22:01:b1
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.2/24
hwaddress 44:38:39:22:01:b1
vlan-raw-device br_default
vlan-id 30
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-backup-ip 10.10.10.2
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 100
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 peerlink
hwaddress 44:38:39:22:01:b1
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
auto lo
iface lo inet loopback
address 10.10.10.2/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto bond1
iface bond1
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 1
auto bond2
iface bond2
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 2
auto bond3
iface bond3
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 3
auto vlan10
iface vlan10
address 10.1.10.3/24
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.3/24
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.3/24
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 30
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-backup-ip 10.10.10.1
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 100
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 peerlink
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
This simulation starts with the example MLAG configuration. The demo is pre-configured using NVUE commands.
To validate the configuration, run the commands listed in the troubleshooting section above.
In Cumulus Linux, LACP bypass allows a bond configured in 802.3ad mode to become active and forward traffic even when there is no LACP partner. For example, you can enable a host that does not have the capability to run LACP to PXE boot while connected to a switch on a bond configured in 802.3ad mode. After the pre-boot process completes and the host is capable of running LACP, the normal 802.3ad link aggregation operation takes over.
LACP Bypass All-active Mode
In all-active mode, when a bond has multiple slave interfaces, each bond slave interface operates as an active link while the bond is in bypass mode. This is useful during PXE boot of a server with multiple NICs, when you cannot determine beforehand which port needs to be active.
All-active mode is not supported on bonds that are not specified as bridge ports on the switch.
STP does not run on the individual bond slave interfaces when the LACP bond is in all-active mode. Only use all-active mode on host-facing LACP bonds. Configure STP BPDU guard together with all-active mode.
In an MLAG deployment where bond slaves of a host connect to two switches and the bond is in all-active mode, all the slaves of the bond are active on both the primary and secondary MLAG nodes. If multiple physical NIC interfaces or more than one physical NIC is present on the physical host, NVIDIA recommends that you define which physical NIC or interface runs the PXE boot inside the PXE boot configuration file. If you do not define a specific NIC or interface, the switch sends a PXE boot request on all the interfaces in the bond and the PXE request fails.
Cumulus Linux does not support priority mode, bond-lacp-bypass-period, bond-lacp-bypass-priority, and bond-lacp-bypass-all-active.
Configure LACP Bypass
To enable LACP bypass on the host-facing bond:
The following commands create a VLAN-aware bridge with LACP bypass enabled:
cumulus@leaf01:~$ nv set interface bond1 bond member swp1-2
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf01:~$ nv config apply
Edit the /etc/network/interfaces file to add the set bond-lacp-bypass-allow to yes option, then run the ifreload -a command. The following configuration creates a VLAN-aware bridge with LACP bypass enabled.
To show the bond configuration, run the nv show interface <bond> command.
cumulus@leaf01:mgmt:~$ nv show interface bond1
operational applied description
----------------------- ----------------- ---------- ----------------------------------------------------------------------
type bond bond The type of interface
[acl] Interface ACL rules
bond
down-delay 0 0 bond down delay
lacp-bypass on lacp bypass
lacp-rate fast fast lacp rate
mode lacp bond mode
up-delay 0 0 bond up delay
[member] swp1 swp1 Set of bond members
mlag
enable on Turn the feature 'on' or 'off'. The default is 'off'.
id 1 1 MLAG id
peer-interface bond1 Peer interface
status dual Mlag Interface status
bridge
[domain] br_default br_default Bridge domains on this interface
...
To check the status of the link, run the Linux ip link show command on the bond and its slave interfaces:
cumulus@switch:~$ ip link show bond1
164: bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue master br0 state UP mode DORMANT group default
link/ether c4:54:44:f6:44:5a brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show swp1
55: swp1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond1 state UP mode DEFAULT group default qlen 1000
link/ether c4:54:44:f6:44:5a brd ff:ff:ff:ff:ff:ff
cumulus@switch:~$ ip link show swp2
56: swp2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master bond1 state UP mode DEFAULT group default qlen 1000
link/ether c4:54:44:f6:44:5a brd ff:ff:ff:ff:ff:ff
Virtual Router Redundancy - VRR and VRRP
Cumulus Linux provides the option of using VRR or VRRP.
VRR enables hosts to communicate with any redundant switch without reconfiguration by running dynamic router protocols or router redundancy protocols. Redundant switches respond to ARP requests from hosts. The switches respond in an identical manner, but if one fails, the other redundant switches continue to respond. You use VRR with MLAG.
Use VRR when you connect multiple devices to a single logical connection, such as an MLAG bond. A device that connects to an MLAG bond believes there is a single device on the other end of the bond and only forwards one copy of the transit frames. If the destination of this frame is the virtual MAC address and you are running VRRP, the frame can go to the link connected to the VRRP standby device, which does not forward the frame to the right destination. With the virtual MAC active on both MLAG devices, either MLAG device handles the frame it receives.
VRRP allows two or more network devices in an active or standby configuration to share a single virtual default gateway. The physical VRRP switch that forwards packets at any given time is the master. If this VRRP switch fails, another VRRP standby switch automatically takes over as master. You use VRRP without MLAG.
Use VRRP when you have multiple distinct devices that connect to a layer 2 segment through multiple logical connections (not through a single bond). VRRP elects a single active forwarder that owns the virtual MAC address while it is active. This prevents the forwarding database of the layer 2 domain from continuously updating in response to MAC flaps because the switch receives frames sourced from the virtual MAC address from discrete logical connections.
You cannot configure both VRR and VRRP on the same switch.
VRR
The diagram below illustrates a basic VRR-enabled network configuration.
The network includes three servers and two Cumulus Linux switches. The switches use MLAG.
As the bridges in each of the redundant switches connect, they each receive and reply to ARP requests for the virtual router IP address.
Each ARP request by a server receives replies from each switch; these replies are identical, and the server receiving the replies either ignores replies after the first, or accepts them and overwrites the previous identical reply.
VRR uses the default fabric-wide MAC address 00:00:5E:00:01:01. If necessary, you can change the VRR MAC address.
Configure the Switches
The switches implement the layer 2 network interconnecting the servers and the redundant switches. To configure the switches, add a bridge with the following interfaces to each switch:
One bond interface or switch port interface to each server. For networks using MLAG, use bond interfaces. Otherwise, use switch port interfaces.
One or more interfaces to each peer switch. To accommodate higher bandwidth between the switches and to offer link redundancy, multiple inter-peer links are typically bonded interfaces. The VLAN interface must have a unique IP address for both the physical and virtual interface; the switch uses the unique address when it initiates an ARP request.
Cumulus Linux only supports VRR on an SVI. You cannot configure VRR on a physical interface or virtual subinterface.
The example commands below create a VLAN-aware bridge interface for a VRR-enabled network. The example assumes you have already configured a VLAN-aware bridge with VLAN 10 and that VLAN 10 has an IP address and uses the default fabric-wide VRR MAC address 00:00:5e:00:01:01.
cumulus@switch:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@switch:~$ nv set interface vlan10 ip vrr state up
cumulus@switch:~$ nv config apply
Use the same commands for IPV6 addresses; for example:
cumulus@switch:~$ nv set interface vlan10 ip vrr address 2001:db8::1/32
cumulus@switch:~$ nv set interface vlan10 ip vrr state up
Edit the /etc/network/interfaces file, then run the ifreload -a command.
Cumulus Linux sets a fabric-wide MAC address to ensure consistency across VRR switches, which is especially useful in an EVPN multi-fabric environment. If you prefer, you can change the VRR MAC address globally with one NVUE command. You can also override the global setting for a specific VLAN.
To set the VRR MAC address globally with one NVUE command, either:
Set the fabric-wide VRR MAC address to a value in the reserved range between 00:00:5E:00:01:00 and 00:00:5E:00:01:FF. Be sure to use an address in this reserved range to prevent MAC address conflicts with other interfaces in the same bridged network.
Set a fabric ID, from which Cumulus Linux derives the MAC address. You can specify a number between 1 and 255. Cumulus Linux adds the number to the MAC address 00:00:5E:00:01:00 in hex. For example, if you specify 255, the VRR MAC address is 00:00:5E:00:01:FF.
The default VRR MAC address is 00:00:5E:00:01:01, which the switch derives from a fabric ID setting of 1.
To change a VRR MAC address globally on the switch, run the nv set system global fabric-mac <mac-address> command:
cumulus@switch:mgmt:~$ nv set system global fabric-mac 00:00:5E:00:01:FF
cumulus@switch:mgmt:~$ nv config apply
To set a fabric ID, run the nv set system global fabric-id <number> command:
cumulus@switch:mgmt:~$ nv set system global fabric-id 255
cumulus@switch:mgmt:~$ nv config apply
To override the global setting for a specific VLAN, run the nv set interface <vlan> ip vrr mac-address <mac-address> command:
cumulus@switch:mgmt:~$ nv set interface vlan10 ip vrr mac-address 00:00:5E:00:01:00
cumulus@switch:mgmt:~$ nv config apply
To change the VRR MAC address manually, edit the /etc/network/interfaces file and update the MAC address in the address-virtual line for each VLAN. Cumulus Linux does not provide a fabric ID option in the /etc/network/interfaces file.
The following example shows vlan10, vlan20, and vlan30:
cumulus@switch:mgmt:~$ sudo nano /etc/network/interfaces
...
auto vlan10
iface vlan10
address 10.1.10.5/24
address-virtual 00:00:5E:00:01:FF 10.1.10.1/24
hwaddress 44:38:39:22:01:c1
vrf RED
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.5/24
address-virtual 00:00:5E:00:01:FF 10.1.20.1/24
hwaddress 44:38:39:22:01:c1
vrf RED
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.5/24
address-virtual 00:00:5E:00:01:FF 10.1.30.1/24
hwaddress 44:38:39:22:01:c1
vrf BLUE
vlan-raw-device br_default
vlan-id 30
...
Make sure to set the same VRR MAC address on both MLAG peers.
EVPN Routing with VRR
In an EVPN routing environment, if you want to configure multiple subnets as VRR addresses on a VLAN, you must configure them with the same VRR MAC address.
The following example commands configure both 10.1.10.1/24 and 10.1.11.1/24 on VLAN 10 using the default fabric-wide VRR MAC address 00:00:5e:00:01:01.
cumulus@switch:mgmt:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@switch:mgmt:~$ nv set interface vlan10 ip vrr address 10.1.11.1/24
cumulus@switch:mgmt:~$ nv config apply
Edit the /etc/network/interfaces file; for example:
To reduce BGP EVPN processing during convergence, NVIDIA recommends that you use the same fabric-wide MAC address across all VLANs and VRR subnets.
Configure the Servers
Each server must have two network interfaces. The switches configure the interfaces as bonds running LACP; the servers must also configure the two interfaces using teaming, port aggregation, port group, or EtherChannel running LACP. Configure the servers either statically or with DHCP, with a gateway address that is the IP address of the virtual router; this default gateway address never changes.
Configure the links between the servers and the switches in active-active mode for FHRP.
Troubleshooting
To verify the configuration on the switch, run the net show interface command:
cumulus@leaf01:mgmt:~$ net show interface
State Name Spd MTU Mode LLDP Summary
----- ------------- --- ----- ------------ ----------------------- -----------------------
UP lo N/A 65536 Loopback IP: 127.0.0.1/8
lo IP: 10.10.10.1/32
lo IP: ::1/128
UP eth0 1G 1500 Mgmt oob-mgmt-switch (swp10) Master: mgmt(UP)
eth0 IP: 192.168.200.11/24
UP swp1 1G 9216 BondMember Master: bond1(UP)
UP swp2 1G 9216 BondMember Master: bond2(UP)
UP swp49 1G 9216 BondMember Master: peerlink(UP)
UP swp50 1G 9216 BondMember Master: peerlink(UP)
UP swp51 1G 9216 Default
UP bond1 1G 9216 802.3ad Master: br_default(UP)
bond1 Bond Members: swp1(UP)
UP bond2 1G 9216 802.3ad Master: br_default(UP)
bond2 Bond Members: swp2(UP)
UP br_default N/A 9216 Bridge/L2
UP mgmt N/A 65536 VRF IP: 127.0.0.1/8
mgmt IP: ::1/128
UP peerlink 2G 9216 802.3ad Master: br_default(UP)
peerlink Bond Members: swp49(UP)
peerlink Bond Members: swp50(UP)
UP peerlink.4094 2G 9216 Default
UP vlan10 N/A 9216 Interface/L3 IP: 10.1.10.2/24
UP vlan10-v0 N/A 9216 Interface/L3 IP: 10.1.10.1/24
...
Configuration Example
The following example creates an MLAG configuration that incorporates VRR.
cumulus@leaf01:mgmt:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:mgmt:~$ nv set interface swp1-3,swp49-51
cumulus@leaf01:mgmt:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:mgmt:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:mgmt:~$ nv set interface bond3 bond member swp3
cumulus@leaf01:mgmt:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:mgmt:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf01:mgmt:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf01:mgmt:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf01:mgmt:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:mgmt:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:mgmt:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:mgmt:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:mgmt:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf01:mgmt:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@leaf01:mgmt:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf01:mgmt:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf01:mgmt:~$ nv set interface vlan20 ip address 10.1.20.2/24
cumulus@leaf01:mgmt:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf01:mgmt:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf01:mgmt:~$ nv set interface vlan30 ip address 10.1.30.2/24
cumulus@leaf01:mgmt:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf01:mgmt:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf01:mgmt:~$ nv config apply
cumulus@leaf02:mgmt:~$ nv set interface lo ip address 10.10.10.2/32
cumulus@leaf02:mgmt:~$ nv set interface swp1-3,swp49-51
cumulus@leaf02:mgmt:~$ nv set interface bond1 bond member swp1
cumulus@leaf02:mgmt:~$ nv set interface bond2 bond member swp2
cumulus@leaf02:mgmt:~$ nv set interface bond3 bond member swp3
cumulus@leaf02:mgmt:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf02:mgmt:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf02:mgmt:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf02:mgmt:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf02:mgmt:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf02:mgmt:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:mgmt:~$ nv set mlag backup 10.10.10.1
cumulus@leaf02:mgmt:~$ nv set mlag peer-ip linklocal
cumulus@leaf02:mgmt:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf02:mgmt:~$ nv set interface vlan10 ip address 10.1.10.3/24
cumulus@leaf02:mgmt:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf02:mgmt:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf02:mgmt:~$ nv set interface vlan20 ip address 10.1.20.3/24
cumulus@leaf02:mgmt:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf02:mgmt:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf02:mgmt:~$ nv set interface vlan30 ip address 10.1.30.2/24
cumulus@leaf02:mgmt:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf02:mgmt:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf02:mgmt:~$ nv config apply
To validate the configuration, run the nv show interface <vlan> ip vrr command:
cumulus@leaf02:mgmt:~$ nv show interface vlan10 ip vrr
operational applied description
----------- ----------------- ----------------- ------------------------------------------------------
enable on Turn the feature 'on' or 'off'. The default is 'off'.
mac-address 00:00:5e:00:01:00 00:00:5e:00:01:00 Override anycast-mac
mac-id none Override anycast-id
[address] 10.1.10.1/24 10.1.10.1/24 Virtual addresses with prefixes
state up up The state of the interface
VRRP
VRRP allows two or more network devices in an active standby configuration to share a single virtual default gateway. The VRRP router that forwards packets at any given time is the master. If this VRRP router fails, another VRRP standby router automatically takes over as master. The master sends VRRP advertisements to other VRRP routers in the same virtual router group, which include the priority and state of the master. VRRP router priority determines the role that each virtual router plays and who becomes the new master if the master fails.
All virtual routers use 00:00:5E:00:01:XX for IPv4 gateways or 00:00:5E:00:02:XX for IPv6 gateways as their MAC address. The last byte of the address is the Virtual Router IDentifier (VRID), which is different for each virtual router in the network. Only one physical router uses this MAC address at a time. The router replies with this address when it receives ARP requests or neighbor solicitation packets for the IP addresses of the virtual router.
Cumulus Linux supports both VRRPv2 and VRRPv3. The default protocol version is VRRPv3.
You can configure a maximum of 255 virtual routers on a switch.
The following example illustrates a basic VRRP configuration.
Configure VRRP
To configure VRRP, specify the following information on each switch:
A virtual router ID (VRID) that identifies the group of VRRP routers. You must specify the same ID across all virtual routers in the group.
One or more virtual IP addresses for the virtual router group. These IP addresses do not directly connect to a specific interface. The switch redirects inbound packets to a virtual IP address to a physical network interface.
You can also set these optional parameters:
Optional Parameter
Default Value
Description
priority
100
The priority level of the virtual router within the virtual router group, which determines the role that each virtual router plays and what happens if the master fails. Virtual routers have a priority between 1 and 254; the router with the highest priority becomes the master.
advertisement interval
1000 milliseconds
The advertisement interval is the interval between successive advertisements by the master in a virtual router group. You can specify a value between 10 and 40950.
preempt
enabled
Preempt mode lets the router take over as master for a virtual router group if it has a higher priority than the current master. Preempt mode is on by default. To disable preempt mode, edit the /etc/frr/frr.conf file to add the line no vrrp <VRID> preempt to the interface stanza, then restart the FRR service.
version
3
The VRRP protocol version. You can specify a value of either 2 or 3.
The following example commands configure two switches (spine01 and spine02) that form one virtual router group (VRID 44) with IPv4 address 10.0.0.1/24 and IPv6 address 2001:0db8::1/64. spine01 is the master; it has a priority of 254. spine02 is the backup VRRP router.
The parent interface must use a primary address as the source address on VRRP advertisement packets.
cumulus@spine01:~$ nv set interface swp1 ip address 10.0.0.2/24
cumulus@spine01:~$ nv set interface swp1 ip address 2001:0db8::2/64
cumulus@spine01:~$ nv set interface swp1 ip vrrp virtual-router 44 address 10.0.0.1
cumulus@spine01:~$ nv set interface swp1 ip vrrp virtual-router 44 address 2001:0db8::1
cumulus@spine01:~$ nv set interface swp1 ip vrrp virtual-router 44 priority 254
cumulus@spine01:~$ nv set interface swp1 ip vrrp virtual-router 44 advertisement-interval 5000
cumulus@spine01:~$ nv config apply
cumulus@spine02:~$ nv set interface swp1 ip address 10.0.0.3/24
cumulus@spine02:~$ nv set interface swp1 ip address 2001:0db8::3/64
cumulus@spine02:~$ nv set interface swp1 ip vrrp virtual-router 44 address 10.0.0.1/24
cumulus@spine02:~$ nv set interface swp1 ip vrrp virtual-router 44 address 2001:0db8::1/64
cumulus@spine02:~$ nv config apply
Edit the /etc/network/interface file to assign an IP address to the parent interface; for example:
cumulus@spine01:~$ sudo vi /etc/network/interfaces
...
auto swp1
iface swp1
address 10.0.0.2/24
address 2001:0db8::2/64
Enable the vrrpd daemon, then start the FRR service. See FRRouting.
To show virtual router information on a switch, run the vtysh show vrrp <VRID> command or the net show vrrp <VRID> command. For example:
cumulus@spine01:~$ show vrrp 44
Virtual Router ID 44
Protocol Version 3
Autoconfigured No
Shutdown No
Interface swp1
VRRP interface (v4) vrrp4-3-1
VRRP interface (v6) vrrp6-3-1
Primary IP (v4) 10.0.0.2
Primary IP (v6) 2001:0db8::2
Virtual MAC (v4) 00:00:5e:00:01:01
Virtual MAC (v6) 00:00:5e:00:02:01
Status (v4) Master
Status (v6) Master
Priority 254
Effective Priority (v4) 254
Effective Priority (v6) 254
Preempt Mode Yes
Accept Mode Yes
Advertisement Interval 5000 ms
Master Advertisement Interval (v4) 0 ms
Master Advertisement Interval (v6) 5000 ms
Advertisements Tx (v4) 17
Advertisements Tx (v6) 17
Advertisements Rx (v4) 0
Advertisements Rx (v6) 0
Gratuitous ARP Tx (v4) 1
Neigh. Adverts Tx (v6) 1
State transitions (v4) 2
State transitions (v6) 2
Skew Time (v4) 0 ms
Skew Time (v6) 0 ms
Master Down Interval (v4) 0 ms
Master Down Interval (v6) 0 ms
IPv4 Addresses 1
. . . . . . . . . . . . . . . . . . 10.0.0.1
IPv6 Addresses 1
. . . . . . . . . . . . . . . . . . 2001:0db8::1
IGMP and MLD Snooping
IGMP and MLD snooping prevent hosts on a local network from receiving traffic for a multicast group they have not explicitly joined. IGMP snooping is for IPv4 environments and MLD snooping is for IPv6 environments.
The bridge driver in Cumulus Linux kernel includes IGMP and MLD snooping. If you disable IGMP or MLD snooping, multicast traffic floods to all the bridge ports in the bridge. Similarly, in the absence of receivers in a VLAN, multicast traffic floods to all ports in the VLAN.
Configure the IGMP and MLD Querier
Without a multicast router, a single switch in an IP subnet can coordinate multicast traffic flows. This switch is the querier or the designated router. The querier generates query messages to check group membership, and processes membership reports and leave messages.
To configure the querier on the switch for a VLAN-aware bridge, enable the multicast querier on the bridge and add the source IP address of the queries to the VLAN.
The following configuration example enables the multicast querier and sets source IP address of the queries to 10.10.10.1 (the loopback address of the switch).
cumulus@switch:~$ nv set bridge domain br_default multicast snooping querier enable on
cumulus@switch:~$ nv set bridge domain br_default vlan 10 multicast snooping querier source-ip 10.10.10.1
cumulus@switch:~$ nv config apply
NVUE commands for a bridge in traditional mode are not supported.
Edit the /etc/network/interfaces file to add bridge-mcquerier 1 to the bridge stanza (this enables the multicast querier on the bridge) and add bridge-igmp-querier-src <ip-address> to the VLAN stanza (the is the source IP address of the queries).
Run the ifreload -a command to reload the configuration:
cumulus@switch:~$ sudo ifreload -a
To configure the querier on the switch for a bridge in traditional mode, edit the bridge stanza in the /etc/network/interfaces file to add bridge-mcquerier 1 (this enables the multicast querier on the bridge) and bridge-mcqifaddr to 1 (this configures the source IP address of the queries to be the bridge IP address).
...
auto br0
iface br0
address 10.10.10.10/24
bridge-ports swp1 swp2 swp3
bridge-vlan-aware no
bridge-mcquerier 1
bridge-mcqifaddr 1
...
Run the ifreload -a command to reload the configuration:
cumulus@switch:~$ sudo ifreload -a
Optimized Multicast Flooding (OMF)
IGMP snooping restricts multicast forwarding only to the ports that receive IGMP report messages. If the ports do not receive IGMP reports, multicast traffic floods to all ports in the bridge domain (also known as unregistered multicast (URMC) traffic). To restrict this flooding to only mrouter ports, you can enable OMF.
In the IGMP snooping unregistered L2 multicast flood control section of the /etc/cumulus/switchd.conf file, uncomment and change these settings to TRUE, then restart switchd.
bridge.unreg_mcast_init
bridge.unreg_v4_mcast_prune
bridge.unreg_v6_mcast_prune
cumulus@switch:~$ sudo nano /etc/cumulus/switchd.conf
...
#IGMP snooping unregistered L2 multicast flood control
#
#Initialize prune module:
bridge.unreg_mcast_init = TRUE
#
#Note:
#Below configuration allowed only when bridge.unreg_mcast_init is set to TRUE
#
#Set below to TRUE to enable unregistered L2 multicast prune to mrouter ports.
#Default is to flood the unregistered L2 multicast
#
bridge.unreg_v4_mcast_prune = TRUE
bridge.unreg_v6_mcast_prune = TRUE
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
When IGMP reports go to a multicast group, OMF has no effect; normal IGMP snooping occurs.
When you enable OMF, you can configure a bridge port as an mrouter port to forward unregistered multicast traffic to that port.
Cumulus Linux does not provide NVUE commands for this setting.
Edit the /etc/network/interfaces file to add bridge-portmcrouter enabled to the swp1 stanza.
You can configure a bridge to use IGMPv2 or IGMPv3. IGMPv2 is the default version. To change the IGMP version, add the bridge-igmp-version <version> parameter to the bridge stanza in the /etc/network/interfaces file. For example, to change the IGMP version to IGMPv3:
Run the ifreload -a command to reload the configuration:
cumulus@switch:~$ sudo ifreload -a
Troubleshooting
To show the IGMP and MLD snooping bridge state, run the brctl showstp <bridge> command:
cumulus@switch:~$ sudo brctl showstp bridge
bridge
bridge id 8000.7072cf8c272c
designated root 8000.7072cf8c272c
root port 0 path cost 0
max age 20.00 bridge max age 20.00
hello time 2.00 bridge hello time 2.00
forward delay 15.00 bridge forward delay 15.00
ageing time 300.00
hello timer 0.00 tcn timer 0.00
topology change timer 0.00 gc timer 263.70
hash elasticity 4096 hash max 4096
mc last member count 2 mc init query count 2
mc router 1 mc snooping 1
mc last member timer 1.00 mc membership timer 260.00
mc querier timer 255.00 mc query interval 125.00
mc response interval 10.00 mc init query interval 31.25
mc querier 0 mc query ifaddr 0
flags
swp1 (1)
port id 8001 state forwarding
designated root 8000.7072cf8c272c path cost 2
designated bridge 8000.7072cf8c272c message age timer 0.00
designated port 8001 forward delay timer 0.00
designated cost 0 hold timer 0.00
mc router 1 mc fast leave 0
flags
swp2 (2)
port id 8002 state forwarding
designated root 8000.7072cf8c272c path cost 2
designated bridge 8000.7072cf8c272c message age timer 0.00
designated port 8002 forward delay timer 0.00
designated cost 0 hold timer 0.00
mc router 1 mc fast leave 0
flags
swp3 (3)
port id 8003 state forwarding
designated root 8000.7072cf8c272c path cost 2
designated bridge 8000.7072cf8c272c message age timer 0.00
designated port 8003 forward delay timer 8.98
designated cost 0 hold timer 0.00
mc router 1 mc fast leave 0
flags
Cumulus Linux tracks multicast group and port state in the MDB. To show the groups and bridge port state, run the Linux sudo bridge mdb show command. To show detailed router ports and group information, run the sudo bridge -d -s mdb show command:
cumulus@switch:~$ sudo bridge -d -s mdb show
dev bridge port swp2 grp 234.10.10.10 temp 241.67
dev bridge port swp1 grp 238.39.20.86 permanent 0.00
dev bridge port swp1 grp 234.1.1.1 temp 235.43
dev bridge port swp2 grp ff1a::9 permanent 0.00
router ports on bridge: swp3
Scale Considerations
The number of unique multicast groups supported in the MDB is 4096 by default. To increase the maximum number of multicast groups in the MDB, edit the /etc/network/interfaces file to add a bridge-hashmax value to the bridge stanza:
The supported values for bridge-hashmax are 512, 1024, 2048, 4096, 8192, 16384, 32768, 65536.
Spectrum 1 switches limit multicast groups to 16300 in the MDB with OMF disabled and 14800 multicast groups with OMF enabled.
On Spectrum 1 switches, to support this uppper limit you must change the forwarding resource profile to rash-custom-profile1, then restart switchd.
DIP-based Multicast Forwarding
Cumulus Linux does not support DIP-based multicast forwarding. Do not configure the 224.0.0.x through 239.0.0.x and 224.128.0.x through 239.128.0.x IP ranges as multicast groups, which map to link-local MAC addresses (01:00:5e:00:00:xx).
VXLAN is a standard overlay protocol that abstracts logical virtual networks from the physical network underneath. You can deploy simple and scalable layer 3 Clos architectures while extending layer 2 segments over that layer 3 network.
VXLAN uses a VLAN-like encapsulation technique to encapsulate MAC-based layer 2 Ethernet frames within layer 3 UDP packets. Each virtual network is a VXLAN logical layer 2 segment. VXLAN scales to 16 million segments - a 24-bit VXLAN network identifier (VNI ID) in the VXLAN header - for multi-tenancy.
Hosts on a given virtual network join together through an overlay protocol that initiates and terminates tunnels at the edge of the multi-tenant network, typically the hypervisor vSwitch or top of rack. These edge points are the VXLAN tunnel end points (VTEP).
Cumulus Linux can start and stop VTEPs in hardware and supports wire-rate VXLAN. VXLAN provides an efficient hashing scheme across the IP fabric during the encapsulation process; the source UDP port is unique, with the hash based on layer 2 through layer 4 information from the original frame. The UDP destination port is the standard port 4789.
Cumulus Linux does not support VXLAN encapsulation over layer 3 subinterfaces (for example, swp3.111) or SVIs as traffic transiting through the switch drops, even if you use the subinterface only for underlay traffic and it does not perform VXLAN encapsulation. Only configure VXLAN uplinks as layer 3 interfaces without any subinterfaces (for example, swp3).
The VXLAN tunnel endpoints cannot share a common subnet; there must be at least one layer 3 hop between the VXLAN source and destination.
Considerations
Cut-through Mode and Store and Forward Switching
On switches with the NVIDIA Spectrum ASICs, Cumulus Linux supports cut-through mode for VXLANs but does not support store and forward switching.
MTU Size for Virtual Network Interfaces
The maximum transmission unit (MTU) size for a virtual network interface should be 50 bytes smaller than the MTU for the physical interfaces on the switch. For more information on setting MTU, read Layer 1 and Switch Port Attributes.
Layer 3 and Layer 2 VNIs Cannot Share the Same ID
A layer 3 VNI and a layer 2 VNI cannot have the same ID. If the VNI IDs are the same, the layer 2 VNI does not get created.
VXLAN enables layer 2 segments to extend over an IP core (the underlay). The initial definition of VXLAN (RFC 7348) does not include any control plane and relied on a flood-and-learn approach for MAC address learning.
EVPN is a standards-based control plane for VXLAN defined in RFC 7432 and draft-ietf-bess-evpn-overlay that allows for building and deploying VXLANs at scale. It relies on multi-protocol BGP (MP-BGP) to exchange information and uses BGP-MPLS IP VPNs (RFC 4364). It enables not only bridging between end systems in the same layer 2 segment but also routing between different segments (subnets). There is also inherent support for multi-tenancy.
Cumulus Linux installs the routing control plane (including EVPN) as part of the FRR package. For more information about FRR, refer to FRRouting.
Key Features
Cumulus Linux fully supports EVPN as the control plane for VXLAN, including for both intra-subnet bridging and inter-subnet routing, and provides these key features:
VNI membership exchange between VTEPs using EVPN type-3 (Inclusive multicast Ethernet tag) routes.
Host MAC and IP address exchange using EVPN type-2 (MAC and IP advertisement) routes.
Host/VM mobility support (MAC and IP moves) through exchange of the MAC Mobility Extended community.
ECMP for overlay networks on NVIDIA Spectrum-A1 ASICs. ECMP occurs in the overlay when there are multiple next hops.
Head end replication is on by default.
Cumulus Linux supports the EVPN address family with both eBGP and iBGP peering. If you configure underlay routing with eBGP, you can use the same eBGP session to carry EVPN routes. In a typical 2-tier Clos network where the leafs are VTEPs, if you use eBGP sessions between the leafs and spines for underlay routing, the same sessions exchange EVPN routes. The spine switches act as route forwarders and do not install any forwarding state as they are not VTEPs. When the switch exchanges EVPN routes over iBGP peering, you can use OSPF as the IGP or resolve next hops using iBGP.
Cumulus Linux disables data plane MAC learning by default on VXLAN interfaces. Do not enable MAC learning on VXLAN interfaces: EVPN installs remote MACs.
Basic Configuration
The following sections provide the basic configuration needed to use EVPN as the control plane for VXLAN in a BGP-EVPN-based layer 2 extension deployment. For layer 3 multi-tenancy configuration, see Inter-subnet Routing. For additional EVPN configuration, see EVPN Enhancements.
Basic EVPN Configuration Commands
Basic configuration in a BGP-EVPN-based layer 2 extension deployment requires you to:
Configure VXLAN interfaces
Configure BGP
Activate the EVPN address family and enable EVPN between BGP neighbors
For a non-VTEP device that is only participating in EVPN route exchange, such as a spine switch where the network deployment uses hop-by-hop eBGP or the switch is acting as an iBGP route reflector, configuring VXLAN interfaces is not required.
Configure VXLAN Interfaces. The following example creates a single VXLAN interface (vxlan0), maps VLAN 10 to vni10 and VLAN 20 to vni20, adds the VXLAN device to the default bridge br_default, and sets the VXLAN local tunnel IP address to 10.10.10.1.
To create a traditional VXLAN device, where each VNI represents a separate device instead of a set of VNIs in a single device model, see VXLAN-Devices.
Configure BGP. The following example commands assign an ASN and router ID to leaf01 and spine01, specify the interfaces between the two BGP peers, and the prefixes to originate. For complete information on how to configure BGP, see Border Gateway Protocol - BGP.
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.1/32
cumulus@leaf01:~$ nv config apply
cumulus@spine01:~$ nv set router bgp autonomous-system 65199
cumulus@spine01:~$ nv set router bgp router-id 10.10.10.101
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp1 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.101/32
cumulus@spine01:~$ nv config apply
Activate the EVPN address family and enable EVPN between BGP neighbors. The following example commands enable EVPN between leaf01 and spine01:
cumulus@leaf01:~$ nv set evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv config apply
You do not need to enable the BGP control plane for all VNIs configured on the switch with NVUE with the advertise-all-vni option. FRR is aware of any local VNIs and MACs, and hosts (neighbors) associated with those VNIs.
After you run nv config save, the NVUE Commands create the following configuration snippet in the /etc/nvue.d/startup.yaml file:
Edit the /etc/network/interfaces file to create a single VXLAN device, attach it to a bridge, map the VLANs to the VNIs, and set the VXLAN local tunnel IP address. The example below creates a single VXLAN interface (vxlan0), maps VLAN 10 to vni10 and VLAN 20 to vni20, and sets the VXLAN local tunnel IP address to 10.10.10.1.
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
vxlan-local-tunnelip 10.10.10.1
...
auto vxlan0
iface vxlan0
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports swp1 swp2 vxlan0
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
To create a traditional VXLAN device, where each VNI represents a separate device instead of a set of VNIs in a single device model, see VXLAN-Devices.
Configure BGP with vtysh commands. The following example commands assign an ASN and router ID to leaf01 and spine01, specify the interfaces between the two BGP peers, and the prefixes to originate. For complete information on how to configure BGP, see Border Gateway Protocol - BGP.
Activate the EVPN address family and enable EVPN between BGP neighbors. The following example commands enable EVPN between leaf01 and spine01. The commands automatically provision all locally configured VNIs so the BGP control plane can advertise them.
You only need to set the advertise-all-vni option on leafs that are VTEPs. The switch accepts EVPN routes from a BGP peer even without this option. The routes are in the global EVPN routing table but Cumulus Linux only imports them into the per-VNI routing table and installs the appropriate entries in the kernel when the VNI corresponding to the received route is locally known.
EVPN and VXLAN Active-active Mode
For EVPN in VXLAN active-active mode, both switches in the MLAG pair establish EVPN peering with other EVPN speakers (for example, with spine switches if using hop-by-hop eBGP) and inform about their locally known VNIs and MACs. When MLAG is active, both switches announce this information with the shared anycast IP address.
For active-active configuration, make sure that:
The clagd-vxlan-anycast-ip and vxlan-local-tunnelip parameters are under the loopback stanza on both peers.
Both peers advertise the anycast address to the routed fabric.
The VNI configuration is identical on both peers.
The peerlink belongs to the bridge.
MLAG synchronizes information between the two switches in the MLAG pair; EVPN does not synchronize.
For type-5 routes in an EVPN symmetric configuration with VXLAN active-active mode, Cumulus Linux uses Primary IP Address Advertisement. For information on configuring Primary IP Address Advertisement, see Advertise Primary IP Address.
For information about active-active VTEPs and anycast IP behavior, and for failure scenarios, see VXLAN Active-active Mode.
Considerations
When you enable EVPN on a VTEP, the switch advertises all its locally defined VNIs and other information, such as MAC addresses, to EVPN peers. There is no provision to only announce certain VNIs.
You can only use ND suppression on Spectrum_A1 and above.
Cumulus Linux enables ARP suppression by default. However, in a VXLAN active-active configuration, if the switch does not suppress ARPs, the control plane does not synchronize neighbor entries between the two switches operating in active-active mode. You do not see any impact on forwarding.
You must configure the overlay (tenants) in a specific VRF and separate from the underlay, which resides in the default VRF. Cumulus Linux does not support layer 3 VNI mapping for the default VRF.
You cannot configure EVPN with Redistribute Neighbor. Enabling both features simultaneously causes instability in IPv4 and IPv6 neighbor entries.
To conform to RFC 6514, Cumulus Linux implements a stricter check on a received type-3 route to ensure that the PMSI attribute is ingress-replication.
When FRR learns about a local VNI and there is no explicit configuration for that VNI in FRR, the switch derives the RD and import and export RTs for this VNI automatically. The RD uses RouterId:VNI-Index and the import and export RTs use AS:VNI. For routes that come from a layer 2 VNI (type-2 and type-3), the RD uses the VXLAN local tunnel IP address (vxlan-local-tunnelip) from the layer 2 VNI interface instead of the RouterId (vxlan-local-tunnelip:VNI). EVPN route exchange uses the RD and RTs.
The RD disambiguates EVPN routes in different VNIs (they can have the same MAC and IP address) while the RTs describe the VPN membership for the route. The VNI-Index for the RD is a unique number that the switch generates. It only has local significance; on remote switches, its only role is for route disambiguation. The switch uses this number instead of the VNI value itself because this number has to be less than or equal to 65535. In the RT, the AS is always a 2-byte value to allow room for a large VNI. If the router has a 4-byte AS, it only uses the lower 2 bytes. This ensures a unique RT for different VNIs while having the same RT for the same VNI across routers in the same AS.
For eBGP EVPN peering, the peers are in a different AS so using an automatic RT of AS:VNI does not work for route import. Therefore, Cumulus Linux treats the import RT as *:VNI to determine which received routes apply to a particular VNI. This only applies when the switch auto-derives the import RT.
If you do not want to derive RDs and RTs automatically, you can define them manually. The following example commands are per VNI.
cumulus@leaf01:~$ nv set evpn vni 10 rd 10.10.10.1:20
cumulus@leaf01:~$ nv set evpn vni 10 route-target export 65101:10
cumulus@leaf01:~$ nv set evpn vni 10 route-target import 65102:10
cumulus@leaf01:~$ nv config apply
cumulus@leaf03:~$ nv set evpn vni 10 rd 10.10.10.3:20
cumulus@leaf03:~$ nv set evpn vni 10 route-target export 65102:10
cumulus@leaf03:~$ nv set evpn vni 10 route-target import 65101:10
cumulus@leaf03:~$ nv config apply
If you delete the RD or RT later, it reverts back to its corresponding default value.
Route target auto derivation does not support 4-byte AS numbers; If the router has a 4-byte AS, you must define the RTs manually.
You can configure multiple RT values. In addition, you can configure both the import and export route targets with a single command by using route-target both:
cumulus@leaf01:~$ nv set evpn vni 10 route-target import 65102:10
cumulus@leaf01:~$ nv set evpn vni 10 route-target import 65102:20
cumulus@leaf01:~$ nv set evpn vni 20 route-target both 65101:10
cumulus@leaf01:~$ nv config apply
cumulus@leaf03:~$ nv set evpn vni 10 route-target import 65101:10
cumulus@leaf03:~$ nv set evpn vni 10 route-target import 65101:20
cumulus@leaf03:~$ nv set evpn vni 20 route-target both 65102:10
cumulus@leaf03:~$ nv config apply
Enable EVPN in an iBGP Environment with an OSPF Underlay
You can use EVPN with an OSPF or static route underlay. This is a more complex configuration than using eBGP. In this case, iBGP advertises EVPN routes directly between VTEPs and the spines are unaware of EVPN or BGP.
The leafs peer with each other in a full mesh within the EVPN address family without using route reflectors. The leafs generally peer to their loopback addresses, which advertise in OSPF. The receiving VTEP imports routes into a specific VNI with a matching route target community.
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp neighbor 10.10.10.2 remote-as internal
cumulus@leaf01:~$ nv set vrf default router bgp neighbor 10.10.10.3 remote-as internal
cumulus@leaf01:~$ nv set vrf default router bgp neighbor 10.10.10.4 remote-as internal
cumulus@leaf01:~$ nv set evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp neighbor 10.10.10.2 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp neighbor 10.10.10.3 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp neighbor 10.10.10.4 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router ospf router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router ospf area 0 network 10.10.10.1/32
cumulus@leaf01:~$ nv set interface lo router ospf passive on
cumulus@leaf01:~$ nv set interface swp49 router ospf area 0.0.0.0
cumulus@leaf01:~$ nv set interface swp50 router ospf area 0.0.0.0
cumulus@leaf01:~$ nv set interface swp51 router ospf area 0.0.0.0
cumulus@leaf01:~$ nv set interface swp52 router ospf area 0.0.0.0
cumulus@leaf01:~$ nv set interface swp49 router ospf network-type point-to-point
cumulus@leaf01:~$ nv set interface swp50 router ospf network-type point-to-point
cumulus@leaf01:~$ nv set interface swp51 router ospf network-type point-to-point
cumulus@leaf01:~$ nv set interface swp52 router ospf network-type point-to-point
cumulus@leaf01:~$ nv config apply
After you run nv config save, the NVUE commands create the following configuration snippet in the /etc/nvue.d/startup.yaml file:
cumulus@leaf01:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
lo:
ip:
address:
10.10.10.1/32: {}
router:
ospf:
area: 0
enable: on
network-type: point-to-point
type: loopback
swp49:
router:
ospf:
area: 0.0.0.0
enable: on
type: swp
swp50:
router:
ospf:
area: 0.0.0.0
enable: on
network-type: point-to-point
type: swp
swp51:
router:
ospf:
area: 0.0.0.0
enable: on
network-type: point-to-point
type: swp
swp52:
router:
ospf:
area: 0.0.0.0
enable: on
network-type: point-to-point
type: swp
bridge:
domain:
br_default:
multicast:
snooping:
enable: off
querier:
enable: on
router:
bgp:
autonomous-system: 65101
enable: on
router-id: 10.10.10.1
ospf:
router-id: 10.10.10.1
enable: on
vrf:
default:
router:
bgp:
peer:
10.10.10.2:
remote-as: internal
type: numbered
address-family:
l2vpn-evpn:
enable: on
10.10.10.3:
remote-as: internal
type: numbered
address-family:
l2vpn-evpn:
enable: on
10.10.10.4:
remote-as: internal
type: numbered
address-family:
l2vpn-evpn:
enable: on
enable: on
address-family:
l2vpn-evpn:
enable: on
evpn:
enable: on
nve:
vxlan:
enable: on
cumulus@leaf01:~$ sudo vtysh
...
leaf01# configure terminal
leaf01(config)# router bgp 65101
leaf01(config-router)# neighbor 10.10.10.2 remote-as internal
leaf01(config-router)# neighbor 10.10.10.3 remote-as internal
leaf01(config-router)# neighbor 10.10.10.4 remote-as internal
leaf01(config-router)# address-family l2vpn evpn
leaf01(config-router-af)# neighbor 10.10.10.2 activate
leaf01(config-router-af)# neighbor 10.10.10.3 activate
leaf01(config-router-af)# neighbor 10.10.10.4 activate
leaf01(config-router-af)# advertise-all-vni
leaf01(config-router-af)# exit
leaf01(config-router)# exit
leaf01(config)# router ospf
leaf01(config-router)# router-id 10.10.10.1
leaf01(config-router)# passive-interface lo
leaf01(config-router)# exit
leaf01(config)# interface lo
leaf01(config-if)# ip ospf area 0.0.0.0
leaf01(config-if)# exit
leaf01(config)# interface swp49
leaf01(config-if)# ip ospf area 0.0.0.0
leaf01(config-if)# ospf network point-to-point
leaf01(config-if)# exit
leaf01(config)# interface swp50
leaf01(config-if)# ip ospf area 0.0.0.0
leaf01(config-if)# ospf network point-to-point
leaf01(config-if)# exit
leaf01(config)# interface swp51
leaf01(config-if)# ip ospf area 0.0.0.0
leaf01(config-if)# ospf network point-to-point
leaf01(config-if)# exit
leaf01(config)# interface swp52
leaf01(config-if)# ip ospf area 0.0.0.0
leaf01(config-if)# ospf network point-to-point
leaf01(config-if)# end
leaf01# write memory
leaf01# exit
The vtysh commands create the following configuration snippet in the /etc/frr/frr.conf file.
...
interface lo
ip ospf area 0.0.0.0
!
interface swp49
ip ospf area 0.0.0.0
ip ospf network point-to-point
!
interface swp50
ip ospf area 0.0.0.0
ip ospf network point-to-point
!
interface swp51
ip ospf area 0.0.0.0
ip ospf network point-to-point
!
interface swp52
ip ospf area 0.0.0.0
ip ospf network point-to-point
!
router bgp 65101
neighbor 10.10.10.2 remote-as internal
neighbor 10.10.10.3 remote-as internal
neighbor 10.10.10.4 remote-as internal
!
address-family l2vpn evpn
neighbor 10.10.10.2 activate
neighbor 10.10.10.3 activate
neighbor 10.10.10.4 activate
advertise-all-vni
exit-address-family
!
Router ospf
Ospf router-id 10.10.10.1
Passive-interface lo
...
ARP and ND Suppression
ARP suppression with EVPN allows a VTEP to suppress ARP flooding over VXLAN tunnels as much as possible. A local proxy handles ARP requests from locally attached hosts for remote hosts. ARP suppression is for IPv4; ND suppression is for IPv6.
Cumulus Linux enables ARP and ND suppression by default on all VNIs to reduce ARP and ND packet flooding over VXLAN tunnels; however, you must configure layer 3 interfaces (SVIs) for ARP and ND suppression to work with EVPN.
ARP and ND suppression only suppresses the flooding of known hosts. To disable all flooding refer to the Disable BUM Flooding section.
NVIDIA recommends that you keep ARP and ND suppression enabled on all VXLAN interfaces on the switch. If you must disable suppression for a special use case, you can not disable ARP and ND suppression on some VXLAN interfaces but not others.
In a centralized routing deployment, you must configure layer 3 interfaces even if you configure the switch only for layer 2 (you are not using VXLAN routing). To avoid installing unnecessary layer 3 information, you can turn off IP forwarding.
The following example commands turn off IPv4 and IPv6 forwarding on VLAN 10 and VLAN 20.
cumulus@leaf01:~$ nv set interface vlan10 ip ipv4 forward off
cumulus@leaf01:~$ nv set interface vlan10 ip ipv6 forward off
cumulus@leaf01:~$ nv set interface vlan20 ip ipv4 forward off
cumulus@leaf01:~$ nv set interface vlan20 ip ipv6 forward off
cumulus@leaf01:~$ nv config apply
Edit the /etc/network/interfaces file.
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto vlan10
iface vlan10
ip6-forward off
ip-forward off
vlan-id 10
vlan-raw-device bridge
auto vlan20
iface vlan20
ip6-forward off
ip-forward off
vlan-id 20
vlan-raw-device bridge
auto vni10
iface vni10
bridge-access 10
vxlan-id 10
bridge-learning off
auto vni20
iface vni20
bridge-access 20
vxlan-id 20
bridge-learning off
...
For a bridge in traditional mode, you must edit the bridge configuration in the /etc/network/interfaces file using a text editor:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto bridge1
iface bridge1
bridge-ports swp1.10 swp2.10 vni10
ip6-forward off
ip-forward off
...
When deploying EVPN and VXLAN using a hardware profile other than the default Forwarding Table Profile, ensure that the Linux kernel ARP sysctl settings gc_thresh2 and gc_thresh3 both have a value larger than the number of neighbor (ARP and ND) entries you expect in the deployment. To configure these settings, edit the /etc/sysctl.d/neigh.conf file, then reboot the switch. If your network has more hosts than the values in the example below, change the sysctl entries accordingly.
Keep ARP and ND suppression on to reduce ARP and ND packet flooding over VXLAN tunnels. However, if you need to disable ARP and ND suppression, follow the example commands below.
cumulus@leaf01:~$ nv set nve vxlan arp-nd-suppress off
cumulus@leaf01:~$ nv config apply
Edit the /etc/network/interfaces file to set bridge-arp-nd-suppress off on the VXLAN device, then run the ifreload -a command.
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20 30=30 4036=4002 4024=4001
bridge-learning off
bridge-arp-nd-suppress off
...
cumulus@leaf01:~$ sudo ifreload -a
The neighbor manager service relies on ARP and ND suppression to snoop on packets and update forwarding entries based on neighbor changes. If you disable suppression, you must enable the neighbor manager snooper manually:
Create the systemd override configuration file /etc/systemd/system/neighmgrd.service with the following content:
Reload the systemd unit configuration with the sudo systemctl daemon-reload command.
Restart the neighmgrd service with the sudo systemctl restart neighmgrd.service command.
Configure Static MAC Addresses
You can configure a MAC address that you intend to pin to a particular VTEP on the VTEP as a static bridge FDB entry. EVPN picks up these MAC addresses and advertises them to peers as remote static MACs. You configure static bridge FDB entries for MAC addresses under the bridge configuration:
Cumulus Linux does not provide NVUE commands for this configuration.
Edit the /etc/network/interfaces file. For example:
When you use EVPN with MLAG, EVPN might install local MAC addresses or neighbor entries as remote entries. To prevent EVPN from taking ownership of local MAC addresses or neighbor entries from MLAG, you can associate all local layer 2 VNIs with a unique site ID, which represents an MLAG pair.
When you configure a site ID, Cumulus Linux:
Adds a Site-of-Origin extended community encoded with the local site ID to EVPN routes that originate from local layer 2 VNIs. Cumulus Linux adds the Site-of-Origin extended community when creating the route.
Filters all received EVPN routes with a Site-of-Origin extended community that matches the local site ID. Cumulus Linux filters the routes when importing the routes from the global table to the layer 2 VNI or layer 3 VNI table.
The site ID is in the format <IPv4 address>:<2-byte Value>, where the IPv4 address is the anycast IP address (a virtual IP address for VXLAN data-path termination) and the 2-byte value is an integer between 0 and 65535. For example: 10.0.1.12:10
NVUE does not provide commands for this feature.
To configure a unique site ID, run the following vtysh commands:
NVIDIA recommends you do not configure a site ID on a standalone or multihoming VTEP.
Filter EVPN Routes
It is common to subdivide the data center into multiple pods with full host mobility within a pod but only do prefix-based routing across pods. You can achieve this by only exchanging EVPN type-5 routes across pods.
The following example commands configure EVPN to advertise type-5 routes:
cumulus@leaf01:~$ nv set router policy route-map map1 rule 10 match type ipv4
cumulus@leaf01:~$ nv set router policy route-map map1 rule 10 match evpn-route-type ip-prefix
cumulus@leaf01:~$ nv set router policy route-map map1 rule 10 action permit
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast route-export to-evpn route-map map1
cumulus@leaf01:~$ nv config apply
You must apply the route map for the configuration to take effect. See Route Maps for more information.
In many situations, it is also desirable to only exchange EVPN routes carrying a particular VXLAN ID.
For example, if data centers or pods within a data center only share certain tenants, you can use a route map to control the EVPN routes exchanged based on the VNI.
The following example configures a route map that only advertises EVPN routes from VNI 1000:
cumulus@switch:~$ nv set router policy route-map map1 rule 10 match evpn-vni 1000
cumulus@switch:~$ nv set router policy route-map map1 rule 10 action permit
cumulus@switch:~$ nv config apply
You can only match type-2 and type-5 routes based on VNI.
Advertise SVI IP Addresses
In a typical EVPN deployment, you reuse SVI IP addresses on VTEPs across multiple racks. However, if you use unique SVI IP addresses across multiple racks and you want the local SVI IP address to be reachable via remote VTEPs, you can enable the advertise SVI IP and MAC address option. This option advertises the SVI IP and MAC address as a type-2 route and eliminates the need for any flooding over VXLAN to reach the IP address from a remote VTEP or rack.
When you enable the advertise SVI IP and MAC address option, the anycast IP and MAC address pair is not advertised. Be sure not to enable both the advertise-svi-ip option and the advertise-default-gw option at the same time. (The advertise-default-gw option configures the gateway VTEPs to advertise their IP and MAC address. See Advertising the Default Gateway.
By default, the VTEP floods all broadcast, and unknown unicast and multicast packets (such as ARP, NS, or DHCP) it receives to all interfaces (except for the incoming interface) and to all VXLAN tunnel interfaces in the same broadcast domain. When the switch receives such packets on a VXLAN tunnel interface, it floods the packets to all interfaces in the packet’s broadcast domain.
You can disable BUM flooding over VXLAN tunnels so that EVPN does not advertise type-3 routes for each local VNI and stops taking action on received type-3 routes.
Disabling BUM flooding is useful in a deployment with a controller or orchestrator, where the switch is pre-provisioned and there is no need to flood any ARP, NS, or DHCP packets.
To show that BUM flooding is off, run the vtysh show bgp l2vpn evpn vni command or the net show bgp l2vpn evpn vni command. For example:
cumulus@leaf01:~$ sudo vtysh
...
leaf01# show bgp l2vpn evpn vni
Advertise Gateway Macip: Disabled
Advertise SVI Macip: Disabled
Advertise All VNI flag: Enabled
BUM flooding: Disabled
Number of L2 VNIs: 3
Number of L3 VNIs: 2
Flags: * - Kernel
VNI Type RD Import RT Export RT Tenant VRF
* 20 L2 10.10.10.1:3 65101:20 65101:20 RED
* 30 L2 10.10.10.1:4 65101:30 65101:30 BLUE
* 10 L2 10.10.10.1:6 65101:10 65101:10 RED
* 4002 L3 10.1.30.2:2 65101:4002 65101:4002 BLUE
* 4001 L3 10.1.20.2:5 65101:4001 65101:4001 RED
Run the vtysh show bgp l2vpn evpn route type multicast command or the net show bgp l2vpn evpn route type multicast command to make sure there are no EVPN type-3 routes that originate locally.
Extended Mobility
Cumulus Linux supports scenarios where the IP to MAC binding for a host or virtual machine changes across the move. In addition to the simple mobility scenario where a host or virtual machine with a binding of IP1, MAC1 moves from one rack to another, Cumulus Linux supports additional scenarios where a host or virtual machine with a binding of IP1, MAC1 moves and takes on a new binding of IP2, MAC1 or IP1, MAC2. The EVPN protocol mechanism to handle extended mobility continues to use the MAC mobility extended community and is the same as the standard mobility procedures. Extended mobility defines how to compute the sequence number in this attribute when binding changes occur.
Extended mobility not only supports virtual machine moves, but also where one virtual machine shuts down and you provision another on a different rack that uses the IP address or the MAC address of the previous virtual machine. For example, in an EVPN deployment with OpenStack, where virtual machines for a tenant provision and shut down dynamically, a new virtual machine can use the same IP address as an earlier virtual machine but with a different MAC address.
To reuse the same distributed gateway on VLANs fabric wide, you can set the fabric-wide MAC address; see Change the VRR MAC address.
Cumulus Linux enables extended mobility by default.
To examine the sequence numbers for a host or virtual machine MAC address and IP address, run the vtysh show evpn mac vni <vni> mac <address> command or the net show evpn mac vni <vni> mac <address> command. For example:
cumulus@switch:~$ sudo vtysh
...
switch# show evpn mac vni 10100 mac 00:02:00:00:00:42
MAC: 00:02:00:00:00:42
Remote VTEP: 10.0.0.2
Local Seq: 0 Remote Seq: 3
Neighbors:
10.1.1.74 Active
switch# show evpn arp vni 10100 ip 10.1.1.74
IP: 10.1.1.74
Type: local
State: active
MAC: 44:39:39:ff:00:24
Local Seq: 2 Remote Seq: 3
Duplicate Address Detection
Cumulus Linux can detect duplicate MAC and IPv4 or IPv6 addresses on hosts or virtual machines in a VXLAN-EVPN configuration. The Cumulus Linux switch (VTEP) considers a host MAC or IP address to be duplicate if the address moves across the network more than a certain number of times within a certain number of seconds (five moves within 180 seconds by default). In addition to legitimate host or VM mobility scenarios, address movement can occur when you configure IP addresses incorrectly on a host or when packet looping occurs in the network due to faulty configuration or behavior.
Cumulus Linux enables duplicate address detection by default, which triggers when:
Two hosts have the same MAC address (the host IP addresses are the same or different)
Two hosts have the same IP address but different MAC addresses
By default, when the switch detects a duplicate address, it flags the address as a duplicate and generates an error in syslog so that you can troubleshoot the reason and address the fault, then clear the duplicate address flag. The switch does not take any functional action on the address.
If the switch flags a MAC address as duplicate, it also flags all IP addresses associated with that MAC as duplicates. However, in an MLAG configuration, sometimes only one of the MLAG peers flags the associated IP addresses as duplicates.
In an MLAG configuration, MAC mobility detection runs independently on each switch in the MLAG pair. Based on the sequence in which local learning and, or route withdrawal from the remote VTEP occurs, the MAC mobility counter for a type-2 route increments only on one of the switches in the MLAG pair. In rare cases, it is possible for neither VTEP to increment the MAC mobility counter for the type-2 prefix.
Duplicate address detection is not supported in an EVPN multihoming configuration.
When Does Duplicate Address Detection Trigger?
The VTEP that sees an address move from remote to local begins the detection process by starting a timer. Each VTEP runs duplicate address detection independently. Detection always starts with the first mobility event from remote to local. If the address is initially remote, the detection count can start with the first move for the address. If the address is initially local, the detection count starts only with the second or higher move for the address. If an address is undergoing a mobility event between remote VTEPs, duplicate detection does not start.
The following illustration shows VTEP-A, VTEP-B, and VTEP-C in an EVPN configuration. Duplicate address detection triggers on VTEP-A when there is a duplicate MAC address for two hosts attached to VTEP-A and VTEP-B. However, duplicate detection does not trigger on VTEP-A when mobility events occur between two remote VTEPs (VTEP-B and VTEP-C).
Configure Duplicate Address Detection
You can configure the threshold for MAC and IP address moves. The maximum number of moves allowed can be between 2 and 1000 and the detection time interval can be between 2 and 1800 seconds.
The following example command sets the maximum number of address moves allowed to 10 and the duplicate address detection time interval to 1200 seconds.
cumulus@switch:~$ nv set evpn dad mac-move-threshold 10
cumulus@switch:~$ nv set evpn dad move-window 1200
cumulus@switch:~$ nv config apply
The following example shows the syslog message that generates when Cumulus Linux detects a MAC address as a duplicate during a local update:
2018/11/06 18:55:29.463327 ZEBRA: [EC 4043309149] VNI 1001: MAC 00:01:02:03:04:11 detected as duplicate during local update, last VTEP 172.16.0.16
The following example shows the syslog message that generates when Cumulus Linux detects an IP address as a duplicate during a remote update:
2018/11/09 22:47:15.071381 ZEBRA: [EC 4043309151] VNI 1002: MAC aa:22:aa:aa:aa:aa IP 10.0.0.9 detected as duplicate during remote update, from VTEP 172.16.0.16
Freeze a Detected Duplicate Address
Cumulus Linux provides a freeze option that takes action on a detected duplicate address. You can freeze the address permanently (until you intervene) or for a defined amount of time, after which it clears automatically.
When you enable the freeze option and the switch detects a duplicate address:
If the switch learns the MAC or IP address from a remote VTEP at the time it freezes, the forwarding information in the kernel and hardware does not update, leaving it in the prior state. Any future remote updates process but they do not reflect in the kernel entry. If the remote VTEP sends a MAC-IP route withdrawal, the local VTEP removes the frozen remote entry. Then, if the local VTEP has a locally learned entry already present in its kernel, FRR originates a corresponding MAC-IP route and advertises it to all remote VTEPs.
If the MAC or IP address is locally learned on this VTEP at the time it freezes, the address does not advertise to remote VTEPs. Future local updates process but do not advertise to remote VTEPs. If FRR receives a local entry delete event, it removes the frozen entry from the FRR database. Any remote updates (from other VTEPs) change the state of the entry to remote but the entry does not install in the kernel (until cleared).
To recover from a freeze, shut down the faulty host or VM or fix any other misconfiguration in the network. If the address freezes permanently, run the clear command on the VTEP where the address is duplicate. If the address freezes for a defined period of time, it clears automatically after the timer expires (you can clear the duplicate address before the timer expires with the clear command).
If you run the clear command or the timer expires before you address the fault, duplicate address detection can continue to occur.
After you clear a frozen address, if it is present behind a remote VTEP, the kernel and hardware forwarding tables update. If this VTEP learns the address locally, the address advertises to remote VTEPs. All VTEPs get the correct address as soon as the host communicates. The switch only learns silent hosts after the faulty entries age out, or you intervene and clear the faulty MAC and ARP table entries.
Configure the Freeze Option
You can enable Cumulus Linux to freeze detected duplicate addresses. The duration can be any number of seconds between 30 and 3600.
The following example command freezes duplicate addresses for a period of 1000 seconds, after which it clears automatically:
cumulus@switch:~$ nv set evpn dad duplicate-action freeze duration 1000
cumulus@switch:~$ nv config apply
Set the freeze timer to be three times the duplicate address detection window. For example, if the duplicate address detection window is 180 seconds, set the freeze timer to 540 seconds.
The following example command freezes duplicate addresses permanently (until you run the clear command):
cumulus@switch:~$ nv set evpn dad duplicate-action freeze duration permanent
cumulus@switch:~$ nv config apply
In an MLAG configuration, you need to run the clear command on both the MLAG primary and secondary switch.
When you clear a duplicate MAC address, all its associated IP addresses also clear. However, you cannot clear an associated IP address if its MAC address is still in a duplicate state.
Disable Duplicate Address Detection
Duplicate address detection is on by default. The switch generates a syslog error when it detects a duplicate address. To disable duplicate address detection, run the following command.
cumulus@switch:~$ nv set evpn dad enable off
cumulus@switch:~$ nv config apply
When you disable duplicate address detection, Cumulus Linux clears the configuration and all existing duplicate addresses.
Show Detected Duplicate Address Information
During the duplicate address detection process, you can see the start time and current detection count with the vtysh show evpn mac vni <vni_id> mac <mac_addr> command. The following command example shows that detection starts for MAC address 00:01:02:03:04:11 for VNI 1001 on Tuesday, Nov 6 at 18:55:05 and Cumulus Linux detects one move.
cumulus@switch:~$ sudo vtysh
...
switch# show evpn mac vni 1001 mac 00:01:02:03:04:11
MAC: 00:01:02:03:04:11
Intf: hostbond3(15) VLAN: 1001
Local Seq: 1 Remote Seq: 0
Duplicate detection started at Tue Nov 6 18:55:05 2018, detection count 1
Neighbors:
10.0.1.26 Active
After the duplicate MAC address clears, the vtysh show evpn mac vni <vni_id> mac <mac_addr> command shows:
MAC: 00:01:02:03:04:11
Remote VTEP: 172.16.0.16
Local Seq: 13 Remote Seq: 14
Duplicate, detected at Tue Nov 6 18:55:29 2018
Neighbors:
10.0.1.26 Active
To display information for a duplicate IP address, run the vtysh show evpn arp-cache vni <vni_id> ip <ip_addr> command. The following command example shows information for IP address 10.0.0.9 for VNI 1001.
cumulus@switch:~$ sudo vtysh
...
switch# show evpn arp-cache vni 1001 ip 10.0.0.9
IP: 10.0.0.9
Type: remote
State: inactive
MAC: 00:01:02:03:04:11
Remote VTEP: 10.0.0.34
Local Seq: 0 Remote Seq: 14
Duplicate, detected at Tue Nov 6 18:55:29 2018
To show a list of MAC addresses detected as duplicate for a specific VNI or for all VNIs, run the vtysh show evpn mac vni <vni-id|all> duplicate command or the net show evpn mac vni <vni-id|all> duplicate command. The following example command shows a list of duplicate MAC addresses for VNI 1001:
cumulus@switch:~$ sudo vtysh
...
switch# show evpn mac vni 1001 duplicate
Number of MACs (local and remote) known for this VNI: 16
MAC Type Intf/Remote VTEP VLAN
aa:bb:cc:dd:ee:ff local hostbond3 1001
To show a list of IP addresses detected as duplicate for a specific VNI or for all VNIs, run the vtysh show evpn arp-cache vni <vni-id|all> duplicate command or the net show evpn arp-cache vni <vni-id|all> duplicate command. The following example command shows a list of duplicate IP addresses for VNI 1001:
cumulus@switch:~$ sudo vtysh
...
switch# show evpn arp-cache vni 1001 duplicate
Number of ARPs (local and remote) known for this VNI: 20
IP Type State MAC Remote VTEP
10.0.0.8 local active aa:11:aa:aa:aa:aa
10.0.0.9 local active aa:11:aa:aa:aa:aa
10.10.0.12 remote active aa:22:aa:aa:aa:aa 172.16.0.16
To show configured duplicate address detection parameters, run the vtysh show evpn command or the net show evpn command:
cumulus@switch:~$ sudo vtysh
...
switch# show evpn
L2 VNIs: 4
L3 VNIs: 2
Advertise gateway mac-ip: No
Duplicate address detection: Enable
Detection max-moves 7, time 300
Detection freeze permanent
Inter-subnet Routing
EVPN includes multiple models for routing between different subnets (VLANs), also known as inter-VLAN routing. The model you choose depends if every VTEP acts as a layer 3 gateway and performs routing or if only specific VTEPs perform routing, and if routing occurs only at the ingress of the VXLAN tunnel or both the ingress and the egress of the VXLAN tunnel.
Cumulus Linux supports these models:
Centralized routing: Specific VTEPs act as designated layer 3 gateways and perform routing between subnets; other VTEPs just perform bridging.
Distributed asymmetric routing: Every VTEP participates in routing, but all routing is done at the ingress VTEP; the egress VTEP only performs bridging.
Distributed symmetric routing: Every VTEP participates in routing and performs routing at both the ingress VTEP and the egress VTEP.
You typically deploy distributed routing with the VTEPs that have an anycast IP and MAC address for each subnet; each VTEP that has a particular subnet has the same IP/MAC for that subnet. Such a model facilitates easy host and VM mobility as there is no need to change the host or VM configuration when it moves from one VTEP to another.
All routing occurs in the context of a tenant VRF (virtual routing and forwarding). Cumulus Linux provisions a VRF instance for each tenant and associates the subnets of the tenant with that VRF (the corresponding SVI attaches to the VRF). Inter-subnet routing for each tenant occurs within the context of the VRF for that tenant and is separate from the routing for other tenants.
Centralized Routing
In centralized routing, you configure a specific VTEP to act as the default gateway for all the hosts in a particular subnet throughout the EVPN fabric. It is common to provision a pair of VTEPs in active-active mode as the default gateway using an anycast IP and MAC address for each subnet. You need to configure all subnets on such a gateway VTEP. When a host in one subnet wants to communicate with a host in another subnet, it addresses the packets to the gateway VTEP. The ingress VTEP (to which the source host attaches) bridges the packets to the gateway VTEP over the corresponding VXLAN tunnel. The gateway VTEP routes to the destination host and, post-routing, the packet bridges to the egress VTEP (to which the destination host attaches). The egress VTEP then bridges the packet on to the destination host.
To enable centralized routing, you must configure the gateway VTEPs to advertise their IP and MAC address.
cumulus@leaf01:~$ nv set evpn route-advertise default-gateway on
cumulus@leaf01:~$ nv config apply
You can deploy centralized routing at the VNI level, where you can configure the advertise-default-gw command per VNI; you use centralized routing for certain VNIs and distributed symmetric routing (described below) other VNIs. NVIDIA does not recommend this type of configuration.
When you use centralized routing, even if the source host and destination host attach to the same VTEP, the packets travel to the gateway VTEP, the switch routes the packets, then the packets come back.
Asymmetric Routing
In distributed asymmetric routing, each VTEP acts as a layer 3 gateway, performing routing for its attached hosts. The routing is called asymmetric because only the ingress VTEP performs routing, the egress VTEP only performs bridging. You can achieve asymmetric routing with only host routing, which does not involve any interconnecting VNIs. However, you must provision each VTEP with all VLANs and corresponding VNIs (the subnets between which communication takes place); this is required even if there are no locally-attached hosts for a particular VLAN.
The only additional configuration required to implement asymmetric routing beyond the standard configuration for a layer 2 VTEP described above is to ensure that each VTEP has all VLANs (and corresponding VNIs) provisioned and the SVI for each VLAN is configured with an anycast IP or MAC address.
NVIDIA recommends you use symmetric or centralized routing instead of asymmetric routing.
Asymmetric routing does not support EVPN multihoming.
Symmetric Routing
In distributed symmetric routing, each VTEP acts as a layer 3 gateway, performing routing for its attached hosts; however, both the ingress VTEP and egress VTEP route the packets (similar to the traditional routing behavior of routing to a next hop router). In the VXLAN encapsulated packet, the inner destination MAC address is the router MAC address of the egress VTEP to indicate that the egress VTEP is the next hop and also needs to perform routing. All routing happens in the context of a tenant (VRF). For a packet that the ingress VTEP receives from a locally attached host, the SVI interface corresponding to the VLAN determines the VRF. For a packet that the egress VTEP receives over the VXLAN tunnel, the VNI in the packet has to specify the VRF. For symmetric routing, this is a VNI corresponding to the tenant and is different from either the source VNI or the destination VNI. This VNI is a layer 3 VNI or interconnecting VNI. The regular VNI, which maps a VLAN, is the layer 2 VNI.
Cumulus Linux supports symmetric routing on NVIDIA Spectrum-A1 and later.
Cumulus Linux uses a one-to-one mapping between a layer 3 VNI and a tenant (VRF).
The VRF to layer 3 VNI mapping has to be consistent across all VTEPs.
A layer 3 VNI and a layer 2 VNI cannot have the same ID. If the VNI IDs are the same, Cumulus Linux does not create the layer 2 VNI.
In an MLAG configuration, the SVI for the layer 3 VNI cannot be part of the bridge. This ensures that the switch does not forward traffic tagged with that VLAN ID on the peer link or other trunks.
In an EVPN symmetric routing configuration, when the switch announces a type-2 (MAC/IP) route, in addition to containing two VNIs (the layer 2 VNI and the layer 3 VNI), the route also contains separate RTs for layer 2 and layer 3. The layer 3 RT associates the route with the tenant VRF. By default, this is auto-derived using the layer 3 VNI instead of the layer 2 VNI; however you can also configure it.
Specify the VRF to layer 3 VNI mapping. This configuration is for the BGP control plane.
cumulus@leaf01:~$ nv set vrf RED evpn vni 4001
cumulus@leaf01:~$ nv config apply
When you run the nv set vrf RED evpn vni 4001 command, NVUE:
Creates a layer 3 VNI called vni4001
Assigns the vni4001 a VLAN automatically from the reserved VLAN range and adds _l3 (layer 3) at the end (for example vlan220_l3)
Creates a layer 3 bridge called br_l3vni
Adds vni4001 to the br_l3vni bridge
Assigns vlan4024 to VRF RED
cumulus@leaf01:~$ sudo cat /etc/network/interfaces
...
auto vni4001
iface vni4001
bridge-access 220
bridge-learning off
vxlan-id 4001
auto vlan220_l3
iface vlan220_l3
vrf RED
vlan-raw-device br_l3vni
address-virtual 44:38:39:BE:EF:AA
vlan-id 220
...
Configure a per-tenant VXLAN interface that specifies the layer 3 VNI for the tenant. This VXLAN interface is part of the bridge and the router MAC address of the remote VTEP installs over this interface.
Edit the /etc/network/interfaces file. For example:
Configure an SVI (layer 3 interface) corresponding to the per-tenant VXLAN interface. This attaches to the VRF of the tenant. Remote host routes for symmetric routing install over this SVI.
Edit the /etc/network/interfaces file. For example:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto vlan10
iface vlan10
vlan-id 10
vlan-raw-device bridge
vrf RED
...
Specify the VRF to layer 3 VNI mapping. This configuration is for the BGP control plane.
Do not add the layer 3 VNI VLAN IDs to the bridge vids list in the layer 2 bridge configuration.
When two VTEPs are operating in VXLAN active-active mode and performing symmetric routing, you need to configure the router MAC corresponding to each layer 3 VNI to ensure both VTEPs use the same MAC address. Specify the address-virtual (MAC address) for the SVI corresponding to the layer 3 VNI. Use the same address on both switches in the MLAG pair. Use the MLAG system MAC address. See Advertise Primary IP Address.
Configure RD and RTs for the Tenant VRF
If you do not want Cumulus Linux to derive the RD and RTs (layer 3 RTs) for the tenant VRF automatically, you can configure them manually by specifying them under the l2vpn evpn address family for that specific VRF.
You can configure the RD, the RT you want to attach to the host or prefix routes when importing them into EVPN, and the RTs to attach to host or prefix routes when importing them into a VRF.
The tenant VRF RD and RTs are different from the RD and RTs for the layer 2 VNI. To define the RD and RTs for the layer 2 VNI, see Define RDs and RTs.
cumulus@leaf01:~$ nv set vrf RED router bgp rd 10.1.20.2:5
cumulus@leaf01:~$ nv set vrf RED router bgp route-import from-evpn route-target 65102:4001
cumulus@leaf01:~$ nv set vrf RED router bgp route-export to-evpn route-target 65101:4002
cumulus@leaf01:~$ nv config apply
Symmetric routing presents a problem in the presence of silent hosts. If the ingress VTEP does not have the destination subnet and the host route does not advertise for the destination host, the ingress VTEP cannot route the packet to its destination. You can overcome this problem by having VTEPs announce the subnet prefixes corresponding to their connected subnets in addition to announcing host routes. Cumulus Linux announces these routes as EVPN prefix (type-5) routes.
Ensure that the routes corresponding to the connected subnets are in the BGP VRF routing table by injecting them using the network command or redistributing them using the redistribute connected command.
Use this configuration only if you have silent hosts and only on one VTEP per subnet (or two for redundancy).
Prefix-based Routing
EVPN in Cumulus Linux supports prefix-based routing using EVPN type-5 (prefix) routes. Type-5 routes (or prefix routes) primarily route to destinations outside of the data center fabric.
EVPN prefix routes carry the layer 3 VNI and router MAC address and follow the symmetric routing model to route to the destination prefix.
When connecting to a WAN edge router to reach destinations outside the data center, deploy specific border or exit leaf switches to originate the type-5 routes.
On switches with Spectrum ASICs, centralized routing, symmetric routing, and prefix-based routing only work with Spectrum-A1 and later.
Install EVPN Type-5 Routes
For a switch to install EVPN type-5 routes into the routing table, you must configure layer 3 VNI related information. This configuration is the same as for symmetric routing. You need to:
Configure a per-tenant VXLAN interface that specifies the layer 3 VNI for the tenant. This VXLAN interface is part of the bridge; router MAC addresses of remote VTEPs install over this interface.
Configure an SVI (layer 3 interface) corresponding to the per-tenant VXLAN interface. This attaches to the VRF of the tenant. The remote prefix routes install over this SVI.
Specify the mapping of the VRF to layer 3 VNI. This configuration is for the BGP control plane.
Announce EVPN Type-5 Routes
The tenant VRF requires the following configuration to announce IP prefixes in the BGP RIB as EVPN type-5 routes.
cumulus@leaf01:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn enable on
cumulus@leaf01:~$ nv config apply
The vtysh commands create the following snippet in the /etc/frr/frr.conf file:
...
router bgp 65101 vrf RED
address-family l2vpn evpn
advertise ipv4 unicast
exit-address-family
end
...
Control RIB Routes
By default, when announcing IP prefixes in the BGP RIB as EVPN type-5 routes, the switch selects all routes in the BGP RIB to advertise as EVPN type-5 routes. You can use a route map to allow selective route advertisement from the BGP RIB.
The following commands add a route map filter to IPv4 EVPN type-5 route advertisement:
cumulus@leaf01:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn route-map map1
Cumulus Linux supports originating EVPN default type-5 routes. The default type-5 route originates from a border (exit) leaf and advertises to all the other leafs within the pod. Any leaf within the pod follows the default route towards the border leaf for all external traffic (towards the Internet or a different pod).
To originate a default type-5 route in EVPN:
cumulus@leaf01:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn default-route-origination on
cumulus@leaf01:~$ nv set vrf RED router bgp address-family ipv6-unicast route-export to-evpn default-route-origination on
Advertise Primary IP address (VXLAN Active-Active Mode)
In EVPN symmetric routing configurations with VXLAN active-active (MLAG), all EVPN routes advertise with the anycast IP address as the next hop IP address and the anycast MAC address as the router MAC address. In a failure scenario, the switch might forward traffic to a leaf switch that does not have the destination routes. To prevent dropped traffic in this failure scenario, Cumulus Linux enables the Advertise Primary IP address feature by default so that the switch handles the next hop IP address of the VTEP conditionally depending on the route type: host type-2 (MAC/IP advertisement) or type-5 (IP prefix route).
For host type-2 routes, the anycast IP address is the next hop IP address and the anycast MAC address is the router MAC address.
For type-5 routes, the system IP address (the unique primary loopback IP address of the VTEP) is the next hop IP address and the unique router MAC address of the VTEP is the router MAC address.
You set the anycast MAC address on both switches in the MLAG pair.
NVUE provides two commands to set the anycast MAC address globally. You can either:
Set the anycast MAC address to a value in the reserved range between 44:38:39:ff:00:00 and 44:38:39:ff:ff:ff. Be sure to use an address in this reserved range to prevent MAC address conflicts with other interfaces in the same bridged network.
Set an anycast MAC ID, from which Cumulus Linux derives the MAC address. You can specify a number between 1 and 65535. Cumulus Linux adds the number to the MAC address 44:38:39:ff:00:00 in hex. For example, if you specify 225, the anycast MAC address is 44:38:39:ff:00:FF.
If you use Linux commands to configure the switch instead of NVUE, add the address-virtual <anycast-mac> option under every VLAN interface in the /etc/network/interfaces file. Cumulus Linux does not provide a global anycast MAC address or MAC ID option in the /etc/network/interfaces file.
To set the anycast MAC address:
cumulus@leaf01:~$ nv set system global anycast-mac 44:38:39:ff:00:ff
cumulus@leaf01:~$ nv config apply
To set the anycast MAC ID:
cumulus@leaf01:~$ nv set system global anycast-id 255
cumulus@leaf01:~$ nv config apply
Edit the /etc/network/interfaces file and add address-virtual <anycast-mac> under each VLAN interface. For example:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto vlan4001
iface vlan4001
address-virtual 44:38:39:BE:EF:AA
vrf RED
vlan-raw-device bridge
vlan-id 4001
...
The anycast MAC address is different from the fabric-wide VRR MAC address, which distributes the same VRR gateway on VLAN interfaces across switches fabric-wide. The following diagram shows the relationship between the anycast MAC address or ID, which is unique for each active-active pair, and the fabric MAC address or ID, which is consistent across the entire fabric.
When configuring third party networking devices using MLAG and EVPN for interoperability, you must configure and announce a single shared router MAC value for each advertised next hop IP address.
Disable Advertise Primary IP Address
Each switch in the MLAG pair advertises type-5 routes with its own system IP, which creates an additional next hop at the remote VTEPs. In a large multi-tenancy EVPN deployment, where additional resources are a concern, you can disable this feature.
To show Advertise Primary IP Address parameters, run the vtysh show bgp l2vpn evpn vni <vni> command or the net show bgp l2vpn evpn vni <vni> command. For example:
To show EVPN routes with Primary IP Advertisement, run the vtysh show bgp l2vpn evpn route command or the net show bgp l2vpn evpn route command. For example:
To show the learned route from an external router injected as a type-5 route, run the vtysh show bgp vrf <vrf> ipv4 unicast command or the net show bgp vrf <vrf> ipv4 unicast command.
Downstream VNI
Downstream VNI (symmetric EVPN route leaking) enables you to assign a VNI from a downstream remote VTEP through EVPN routes instead of configuring layer 3 VNIs globally across the network.
To configure a downstream VNI, you configure tenant VRFs as usual; however, to configure the desired route leaking, you define a route target import and, or export statement.
Configure Route Targets
The route target import or export statement is in the format route-target import|export <asn>:<vni>; for example, route-target import 65101:6000. For route target import statements, you can use route-target import ANY:<vni> for NVUE commands or route-target import *:<vni> in the /etc/frr/frr.conf file. ANY in NVUE commands or the asterisk (*) in the /etc/frr/frr.conf file uses any ASN as a wildcard.
The NVUE commands are as follows:
To configure a route import statement: nv set vrf <vrf> router bgp route-import from-evpn route-target <asn>:<vni>
To configure a route export statement: nv set vrf <vrf> router bgp route-export from-evpn route-target <asn>:<vni>
EVPN symmetric mode supports downstream VNI with layer 3 VNIs and single VXLAN devices only.
You can configure multiple import and export route targets in a VRF.
You can configure selective route targets for individual prefixes with routing policies.
You cannot leak (import) overlapping tenant prefixes into the same destination VRF.
The following example shows a configuration with downstream VNI on leaf01 thru leaf04, and border01.
Traffic Flow between VRF RED and VRF 10
server01 forwards traffic to leaf01.
leaf01 encapsulates the packet with the VNI in its route-target import statement (6000) and tunnels the traffic over to border01.
border01 uses the VNI received from leaf01 to forward the packet.
The reverse traffic from border01 to server01 is encapsulated with the VNI in the route-target import statement on border01 (4001) and tunneled over to leaf01, where routing occurs in VRF RED.
The configuration for the example is below.
On leaf01, you can see the route target (route-target import 65163:6000) under the router bgp 65101 vrf RED and router bgp 65101 vrf BLUE stanza of the /etc/frr/frr.conf file.
On border01, you can see the route targets (route-target import 65101:4001 and route-target import 65101:4002) under the router bgp 65163 vrf VRF10 stanza of the /etc/frr/frr.conf file.
Because the configuration is similar on all the leafs, the example only shows configuration files for leaf01 and border01. For brevity, the example do not show the spine configuration files.
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp1-3,swp51-52
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond3 bond member swp3
cumulus@leaf01:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond1 link mtu 9000
cumulus@leaf01:~$ nv set interface bond2 link mtu 9000
cumulus@leaf01:~$ nv set interface bond3 link mtu 9000
cumulus@leaf01:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf01:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf01:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf01:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf01:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@leaf01:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf01:~$ nv set interface vlan10 ip vrr mac-address 00:00:00:00:00:10
cumulus@leaf01:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf01:~$ nv set interface vlan20 ip address 10.1.20.2/24
cumulus@leaf01:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf01:~$ nv set interface vlan20 ip vrr mac-address 00:00:00:00:00:20
cumulus@leaf01:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf01:~$ nv set interface vlan30 ip address 10.1.30.2/24
cumulus@leaf01:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf01:~$ nv set interface vlan30 ip vrr mac-address 00:00:00:00:00:30
cumulus@leaf01:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf01:~$ nv set vrf RED
cumulus@leaf01:~$ nv set vrf BLUE
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf01:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf01:~$ nv set bridge domain br_default vlan 30 vni 30
cumulus@leaf01:~$ nv set interface vlan10 ip vrf RED
cumulus@leaf01:~$ nv set interface vlan20 ip vrf RED
cumulus@leaf01:~$ nv set interface vlan30 ip vrf BLUE
cumulus@leaf01:~$ nv set nve vxlan source address 10.10.10.1
cumulus@leaf01:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf01:~$ nv set vrf RED evpn vni 4001
cumulus@leaf01:~$ nv set vrf BLUE evpn vni 4002
cumulus@leaf01:~$ nv set system global anycast-mac 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set evpn enable on
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf01:~$ nv set vrf RED router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set vrf RED router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf01:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf01:~$ nv set vrf RED router bgp route-import from-evpn route-target 65163:6000
cumulus@leaf01:~$ nv set vrf BLUE router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set vrf BLUE router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf01:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf01:~$ nv set vrf BLUE router bgp route-import from-evpn route-target 65163:6000
cumulus@leaf01:~$ nv set evpn multihoming enable on
cumulus@leaf01:~$ nv set interface bond1 evpn multihoming segment local-id 1
cumulus@leaf01:~$ nv set interface bond2 evpn multihoming segment local-id 2
cumulus@leaf01:~$ nv set interface bond3 evpn multihoming segment local-id 3
cumulus@leaf01:~$ nv set interface bond1-3 evpn multihoming segment mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set interface bond1-3 evpn multihoming segment df-preference 50000
cumulus@leaf01:~$ nv set interface swp51-52 evpn multihoming uplink on
cumulus@leaf01:~$ nv config apply
cumulus@border01:~$ nv set interface lo ip address 10.10.10.63/32
cumulus@border01:~$ nv set interface swp1-3,swp51-52
cumulus@border01:~$ nv set interface bond1 bond member swp1
cumulus@border01:~$ nv set interface bond2 bond member swp2
cumulus@border01:~$ nv set interface bond3 bond member swp3
cumulus@border01:~$ nv set interface bond1 bond lacp-bypass on
cumulus@border01:~$ nv set interface bond2 bond lacp-bypass on
cumulus@border01:~$ nv set interface bond3 bond lacp-bypass on
cumulus@border01:~$ nv set interface bond1 link mtu 9000
cumulus@border01:~$ nv set interface bond2 link mtu 9000
cumulus@border01:~$ nv set interface bond3 link mtu 9000
cumulus@border01:~$ nv set interface bond1-3 bridge domain br_default
cumulus@border01:~$ nv set interface bond1 bridge domain br_default access 2001
cumulus@border01:~$ nv set interface bond2 bridge domain br_default access 2002
cumulus@border01:~$ nv set interface bond3 bridge domain br_default access 2010
cumulus@border01:~$ nv set interface vlan2001 ip address 10.1.201.1/24
cumulus@border01:~$ nv set interface vlan2002 ip address 10.1.202.1/24
cumulus@border01:~$ nv set interface vlan2010 ip address 10.1.210.1/24
cumulus@border01:~$ nv set bridge domain br_default vlan 2001,2002,2010
cumulus@border01:~$ nv set vrf VRF10
cumulus@border01:~$ nv set vrf EXTERNAL1
cumulus@border01:~$ nv set vrf EXTERNAL2
cumulus@border01:~$ nv set bridge domain br_default vlan 2001 vni 2001
cumulus@border01:~$ nv set bridge domain br_default vlan 2002 vni 2002
cumulus@border01:~$ nv set bridge domain br_default vlan 2010 vni 2010
cumulus@border01:~$ nv set interface vlan2001 ip vrf EXTERNAL1
cumulus@border01:~$ nv set interface vlan2002 ip vrf EXTERNAL2
cumulus@border01:~$ nv set interface vlan2010 ip vrf VRF10
cumulus@border01:~$ nv set nve vxlan source address 10.10.10.63
cumulus@border01:~$ nv set nve vxlan arp-nd-suppress on
cumulus@border01:~$ nv set vrf VRF10 evpn vni 6000
cumulus@border01:~$ nv set system global anycast-mac 44:38:39:BE:EF:FF
cumulus@border01:~$ nv set evpn enable on
cumulus@border01:~$ nv set router bgp autonomous-system 65163
cumulus@border01:~$ nv set router bgp router-id 10.10.10.63
cumulus@border01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@border01:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@border01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@border01:~$ nv set vrf VRF10 router bgp autonomous-system 65163
cumulus@border01:~$ nv set vrf VRF10 router bgp router-id 10.10.10.63
cumulus@border01:~$ nv set vrf VRF10 router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@border01:~$ nv set vrf VRF10 router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@border01:~$ nv set vrf VRF10 router bgp address-family ipv4-unicast route-export to-evpn
cumulus@border01:~$ nv set vrf VRF10 router bgp route-import from-evpn route-target 65101:4001
cumulus@border01:~$ nv set vrf VRF10 router bgp route-import from-evpn route-target 65101:4002
cumulus@border01:~$ nv set vrf EXTERNAL1 router bgp autonomous-system 65163
cumulus@border01:~$ nv set vrf EXTERNAL1 router bgp router-id 10.10.10.63
cumulus@border01:~$ nv set vrf EXTERNAL1 router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@border01:~$ nv set vrf EXTERNAL1 router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@border01:~$ nv set vrf EXTERNAL1 router bgp address-family ipv4-unicast route-export to-evpn
cumulus@border01:~$ nv set vrf EXTERNAL2 router bgp autonomous-system 65163
cumulus@border01:~$ nv set vrf EXTERNAL2 router bgp router-id 10.10.10.63
cumulus@border01:~$ nv set vrf EXTERNAL2 router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@border01:~$ nv set vrf EXTERNAL2 router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@border01:~$ nv set vrf EXTERNAL2 router bgp address-family ipv4-unicast route-export to-evpn
cumulus@border01:~$ nv config apply
cumulus@leaf01:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'10':
vni:
'10': {}
'20':
vni:
'20': {}
'30':
vni:
'30': {}
evpn:
enable: on
multihoming:
enable: on
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
bridge:
domain:
br_default:
access: 10
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 1
mac-address: 44:38:39:BE:EF:AA
link:
mtu: 9000
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
bridge:
domain:
br_default:
access: 20
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 2
mac-address: 44:38:39:BE:EF:AA
link:
mtu: 9000
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
bridge:
domain:
br_default:
access: 30
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 3
mac-address: 44:38:39:BE:EF:AA
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.1/32: {}
type: loopback
swp1:
type: swp
swp2:
type: swp
swp3:
type: swp
swp51:
evpn:
multihoming:
uplink: on
type: swp
swp52:
evpn:
multihoming:
uplink: on
type: swp
vlan10:
ip:
address:
10.1.10.2/24: {}
vrf: RED
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:00:00:00:10
state:
up: {}
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.2/24: {}
vrf: RED
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:00:00:00:20
state:
up: {}
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.2/24: {}
vrf: BLUE
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:00:00:00:30
state:
up: {}
type: svi
vlan: 30
nve:
vxlan:
arp-nd-suppress: on
enable: on
source:
address: 10.10.10.1
router:
bgp:
autonomous-system: 65101
enable: on
router-id: 10.10.10.1
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:AA
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65101
enable: on
route-import:
from-evpn:
route-target:
65163:6000: {}
router-id: 10.10.10.1
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65101
enable: on
route-import:
from-evpn:
route-target:
65163:6000: {}
router-id: 10.10.10.1
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@border01:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'2001':
vni:
'2001': {}
'2002':
vni:
'2002': {}
'2010':
vni:
'2010': {}
evpn:
enable: on
multihoming: {}
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
bridge:
domain:
br_default:
access: 2001
evpn:
multihoming: {}
link:
mtu: 9000
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
bridge:
domain:
br_default:
access: 2002
evpn:
multihoming: {}
link:
mtu: 9000
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
bridge:
domain:
br_default:
access: 2010
evpn:
multihoming: {}
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.63/32: {}
type: loopback
swp1:
type: swp
swp2:
type: swp
swp3:
type: swp
swp51:
evpn:
multihoming: {}
type: swp
swp52:
evpn:
multihoming: {}
type: swp
vlan2001:
ip:
address:
10.1.201.1/24: {}
vrf: EXTERNAL1
type: svi
vlan: 2001
vlan2002:
ip:
address:
10.1.202.1/24: {}
vrf: EXTERNAL2
type: svi
vlan: 2002
vlan2010:
ip:
address:
10.1.210.1/24: {}
vrf: VRF10
type: svi
vlan: 2010
nve:
vxlan:
arp-nd-suppress: on
enable: on
source:
address: 10.10.10.63
router:
bgp:
autonomous-system: 65163
enable: on
router-id: 10.10.10.63
system:
global:
anycast-mac: 44:38:39:BE:EF:FF
system-mac: 44:38:39:22:01:74
hostname: border01
vrf:
EXTERNAL1:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65163
enable: on
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
router-id: 10.10.10.63
EXTERNAL2:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65163
enable: on
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
router-id: 10.10.10.63
VRF10:
evpn:
enable: on
vni:
'6000': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65163
enable: on
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
route-import:
from-evpn:
route-target:
65101:4001: {}
65101:4002: {}
router-id: 10.10.10.63
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@leaf01:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
vxlan-local-tunnelip 10.10.10.1
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
mtu 9000
es-sys-mac 44:38:39:BE:EF:AA
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 10
auto bond2
iface bond2
mtu 9000
es-sys-mac 44:38:39:BE:EF:AA
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 20
auto bond3
iface bond3
mtu 9000
es-sys-mac 44:38:39:BE:EF:AA
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 30
auto vlan10
iface vlan10
address 10.1.10.2/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:b1
vrf RED
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.2/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
hwaddress 44:38:39:22:01:b1
vrf RED
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.2/24
address-virtual 00:00:00:00:00:30 10.1.30.1/24
hwaddress 44:38:39:22:01:b1
vrf BLUE
vlan-raw-device br_default
vlan-id 30
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20 30=30
bridge-vids 10 20 30
bridge-learning off
auto vlan220_l3
iface vlan220_l3
vrf RED
vlan-raw-device br_l3vni
vlan-id 220
auto vlan297_l3
iface vlan297_l3
vrf BLUE
vlan-raw-device br_l3vni
vlan-id 297
auto vxlan99
iface vxlan99
bridge-vlan-vni-map 220=4001 297=4002
bridge-vids 220 297
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 vxlan48
hwaddress 44:38:39:22:01:b1
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
auto br_l3vni
iface br_l3vni
bridge-ports vxlan99
hwaddress 44:38:39:22:01:b1
bridge-vlan-aware yes
cumulus@border01:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.63/32
vxlan-local-tunnelip 10.10.10.63
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto EXTERNAL1
iface EXTERNAL1
vrf-table auto
auto EXTERNAL2
iface EXTERNAL2
vrf-table auto
auto VRF10
iface VRF10
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 2001
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 2002
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 2010
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp51
iface swp51
auto swp52
iface swp52
auto vlan2001
iface vlan2001
address 10.1.201.1/24
hwaddress 44:38:39:22:01:74
vrf EXTERNAL1
vlan-raw-device br_default
vlan-id 2001
auto vlan2002
iface vlan2002
address 10.1.202.1/24
hwaddress 44:38:39:22:01:74
vrf EXTERNAL2
vlan-raw-device br_default
vlan-id 2002
auto vlan2010
iface vlan2010
address 10.1.210.1/24
hwaddress 44:38:39:22:01:74
vrf VRF10
vlan-raw-device br_default
vlan-id 2010
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 2001=2001 2002=2002 2010=2010
bridge-learning off
auto vlan336_l3
iface vlan336_l3
vrf VRF10
vlan-raw-device br_l3vni
vlan-id 336
auto vxlan99
iface vxlan99
bridge-vlan-vni-map 336=6000
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 vxlan48
hwaddress 44:38:39:22:01:74
bridge-vlan-aware yes
bridge-vids 2001 2002 2010
bridge-pvid 1
auto br_l3vni
iface br_l3vni
bridge-ports vxlan99
hwaddress 44:38:39:22:01:74
bridge-vlan-aware yes
This simulation starts with the example downstream VNI configuration. To simplify the example, only one spine is in the topology. The demo is pre-configured using NVUE commands.
fw1 has IP address 10.1.210.254 configured beyond border01 in VRF10.
server01 has IP address 10.1.10.101 as in the example.
To validate the configuration, run the verification commands shown below.
Verify Configuration
To verify the configuration, check that the routes are properly received and tagged:
The following vtysh command on leaf01 shows the route from border01 tagged with route target 6000
The following Linux command on leaf01 shows the encapsulated ID (6000) on the routes:
cumulus@leaf01:mgmt:~$ ip route show vrf RED 10.1.210.0/24
10.1.210.0/24 encap ip id 6000 src 0.0.0.0 dst 10.10.10.63 ttl 0 tos 0 via 10.10.10.63 dev vxlan99 proto bgp metric 20 onlink
The following vtysh command on border01 shows the routes from leaf01 tagged with route targets 4001 and 4002:
The following Linux command on border01 shows the encapsulated IDs (4001 and 4002) on the routes:
cumulus@border01:mgmt:~$ ip route show vrf VRF10
10.1.10.0/24 encap ip id 4001 src 0.0.0.0 dst 10.10.10.1 ttl 0 tos 0 via 10.10.10.1 dev vxlan99 proto bgp metric 20 onlink
10.1.20.0/24 encap ip id 4001 src 0.0.0.0 dst 10.10.10.1 ttl 0 tos 0 via 10.10.10.1 dev vxlan99 proto bgp metric 20 onlink
10.1.30.0/24 encap ip id 4002 src 0.0.0.0 dst 10.10.10.1 ttl 0 tos 0 via 10.10.10.1 dev vxlan99 proto bgp metric 20 onlink
...
Considerations
Centralized Routing with ARP Suppression Enabled on the Gateway
In an EVPN centralized routing configuration, where the layer 2 network extends beyond VTEPs, (for example, a host with bridges), the gateway MAC address does not refresh in the network when ARP suppression exists on the gateway. To work around this issue, disable ARP suppression on the centralized gateway.
Symmetric Routing and the Same SVI IP Address Across Racks
In EVPN symmetric routing, if you use the same SVI IP address across racks (for example, if the SVI IP address for a specific VLAN interface (such as vlan100) is the same on all VTEPs where this SVI is present):
You cannot use ping between SVI IP addresses to verify connectivity between VTEPs because either the local rack itself uses the ping destination IP address or remote racks use the ping destination IP address.
If you use ping from a host to the SVI IP address, sometimes, the local VTEP (gateway) does not reply if the host has an ARP entry from a remote gateway.
Host-to-host traffic does not have these issues.
EVPN Multihoming
EVPN multihoming (EVPN-MH) provides support for all-active server redundancy. It is a standards-based replacement for MLAG in data centers deploying Clos topologies. Replacing MLAG provides these benefits:
Eliminates the need for peerlinks or inter-switch links between the top of rack switches
Allows more than two ToR switches a redundancy group
Provides a single BGP-EVPN control plane
Allows multi-vendor interoperability
EVPN-MH uses BGP-EVPN type-1, type-2 and type-4 routes to discover Ethernet segments (ES) and to forward traffic to those Ethernet segments. The MAC and neighbor databases synchronize between the Ethernet segment peers through these routes as well. An Ethernet segment is a group of switch links that attach to the same server. Each Ethernet segment has an unique Ethernet segment ID (ESI) across the entire PoD.
To configure EVPN-MH, you set an Ethernet segment system MAC address and a local Ethernet segment ID on a static or LACP bond. These two parameters generate the unique MAC-based ESI value (type-3) automatically:
The Ethernet segment system MAC address is the LACP system identifier.
The local Ethernet segment ID configuration defines a local discriminator to uniquely enumerate each bond that shares the same Ethernet segment system MAC address.
The resulting 10-byte ESI value has the following format, where the MMs denote the 6-byte Ethernet segment system MAC address and the XXs denote the 3-byte local Ethernet segment ID value:
03:MM:MM:MM:MM:MM:MM:XX:XX:XX
While you can specify a different system MAC address on different Ethernet segments attached to the same switch, the Ethernet segment system MAC address must be the same on the downlinks attached to the same server.
On Spectrum-2 and Spectrum-3 switches, an Ethernet segment can span more than two switches. Each Ethernet segment is a distinct redundancy group. However, on Spectrum A1 switches, you can include a maximum of two switches in a redundancy group or Ethernet segment.
Required and Supported Features
This section describes features that you must enable to use EVPN multihoming. Other supported and unsupported features are also described.
Required Features
You must enable the following features to use EVPN-MH:
Cumulus Linux uses Head End Replication by default with EVPN multihoming. If you prefer to use EVPN BUM traffic handling with EVPN-PIM on multihomed sites via Type-4/ESR routes, configure EVPN-PIM as described in EVPN BUM Traffic with PIM-SM.
To use EVPN-MH, you must remove any MLAG configuration on the switch:
Remove the clag-id from all interfaces in the /etc/network/interfaces file.
Remove the peerlink interfaces in the /etc/network/interfaces file.
Remove any existing hwaddress (from a Cumulus Linux 3.x MLAG configuration) or address-virtual (from a Cumulus Linux 4.x MLAG configuration) entries from all SVIs corresponding to a layer 3 VNI in the /etc/network/interfaces file.
Remove any clagd-vxlan-anycast-ip configuration in the /etc/network/interfaces file.
Run the sudo ifreload command to reload the configuration.
Supported Features
Known unicast traffic multihoming through type-1/EAD (Ethernet auto discovery) routes and type-2 (non-zero ESI) routes. Includes all-active redundancy using aliasing and support for fast failover.
When an EVPN-MH bond enters LACP bypass state, BGP stops advertising EVPN type-1 and type-4 routes for that bond. The switch disables split-horizon and designated forwarder filters.
When an EVPN-MH bond exits the LACP bypass state, BGP starts advertising EVPN type-1 and type-4 routes for that bond. The switch enables split-horizon and designated forwarder filters.
EVI - Cumulus Linux supports VLAN-based service only, so the EVI is just a layer 2 VNI.
Supported ASICs include NVIDIA Spectrum A1, Spectrum-2 and Spectrum-3.
Supported EVPN Route Types
EVPN multihoming supports the following route types.
Multihomed networks, such as STP bridge domains that are MH connected. EVPN-MH bonds are intended for multihomed end-node device (server) connectivity.
Basic Configuration
To configure EVPN-MH, you must complete all the following steps:
Enable EVPN multihoming.
Configure an ESI on each EVPN-MH bond interface.
Configure multihoming uplinks.
You can associate static and LACP bonds with an ESI.
The switch selects a designated forwarder (DF) for each Ethernet segment. The DF forwards flooded traffic received through the VXLAN overlay to the locally attached Ethernet segment. Specify a preference on an Ethernet segment for the DF election, as this leads to predictable failure scenarios. The EVPN VTEP with the highest DF preference setting becomes the DF. The DF preference setting defaults to 32767.
NVUE generates the EVPN-MH configuration and reloads FRR and ifupdown2. The configuration appears in both the /etc/network/interfaces file and in /etc/frr/frr.conf file.
When you enable EVPN-MH, all SVI MAC addresses advertise as type-2 routes. You do not need to configure a unique SVI IP address or configure the BGP EVPN address family with advertise-svi-ip.
Enable EVPN-MH
NVIDIA recommends that you enable EVPN-MH on all VTEPs throughout the fabric to avoid duplicate packets.
cumulus@leaf01:~$ nv set evpn multihoming enable on
cumulus@leaf01:~$ nv config apply
When you enable multihoming on the Spectrum A1 switch with the nv set evpn multihoming enable on command, NVUE restarts the switchd service, which causes all network ports to reset in addition to resetting the switch hardware configuration.
Set the evpn.multihoming.enable variable in the /etc/cumulus/switchd.conf file to TRUE. Cumulus Linux disables this variable by default.
On the Spectrum A1 switch, you must restart switchd with the sudo systemctl restart switchd.service command after you enable multihoming.
Configure the EVPN-MH Bonds
To configure bond interfaces for EVPN-MH:
You can either set both the local Ethernet segment ID and the segment MAC address to generate a unique ESI automatically or set the 10-byte Ethernet segment ID manually, then set the segment MAC address. You can see both options below.
The following example commands configure each bond interface with the local Ethernet segment ID and the segment MAC address to generate a unique ESI automatically:
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond3 bond member swp3
cumulus@leaf01:~$ nv set interface bond1 evpn multihoming segment local-id 1
cumulus@leaf01:~$ nv set interface bond2 evpn multihoming segment local-id 2
cumulus@leaf01:~$ nv set interface bond3 evpn multihoming segment local-id 3
cumulus@leaf01:~$ nv set interface bond1-3 evpn multihoming segment mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set interface bond1-3 evpn multihoming segment df-preference 50000
cumulus@leaf01:~$ nv config apply
The following example commands configure each bond interface with the Ethernet segment ID manually. The ID must be a 10-byte (80-bit) integer and must be unique. When you configure the 10-byte Ethernet segment ID, ensure that the local ID is not present. You must also configure the segment MAC address. The example configures a global segment MAC address for use on all the Ethernet segment bonds.
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond3 bond member swp3
cumulus@leaf01:~$ nv set interface bond1 evpn multihoming segment identifier 00:44:38:39:BE:EF:AA:00:00:01
cumulus@leaf01:~$ nv set interface bond2 evpn multihoming segment identifier 00:44:38:39:BE:EF:AA:00:00:02
cumulus@leaf01:~$ nv set interface bond3 evpn multihoming segment identifier 00:44:38:39:BE:EF:AA:00:00:03
cumulus@leaf01:~$ nv set interface bond1-3 evpn multihoming segment df-preference 50000
cumulus@leaf01:~$ nv set evpn multihoming segment mac-address 44:38:39:ff:ff:01
cumulus@leaf01:~$ nv config apply
The following example commands configure each bond interface with the local Ethernet segment ID and the segment MAC address to generate a unique ESI automatically:
Configure the ESI on each bond interface with the local Ethernet segment ID and the segment MAC address:
The following example commands configure each bond interface with the Ethernet segment ID manually. The ID must be a 10-byte (80-bit) integer and must be unique. When you configure the 10-byte Ethernet segment ID, ensure that the local ID is not present. You must also configure the segment MAC address separately. The example configures a global segment MAC address for use on all the Ethernet segment bonds.
Configure each bond interface with the Ethernet segment ID manually:
When all uplinks go down, the VTEP loses connectivity to the VXLAN overlay. To prevent traffic loss, Cumulus Linux tracks the operational state of the uplink. When all the uplinks are down, the Ethernet segment bonds on the switch are in a protodown or error-disabled state. An MH uplink is any routed interface to which the switch routes locally encapsulated VXLAN traffic (after encapsulation) or any routed interface receiving VXLAN traffic (before decapsulation) that the local device decapsulates.
Split-horizon and Designated-Forwarder filters only apply to interfaces that are MH uplinks.
If you configure EVPN-MH without MH uplinks, BUM traffic duplicates or loops back to the same ES. This can cause MAC flaps or other issues on multihomed devices.
cumulus@leaf01:~$ nv set interface swp51-54 evpn multihoming uplink on
cumulus@leaf01:~$ nv config apply
If you are configuring EVPN multihoming with EVPN-PIM, be sure to configure PIM on the interfaces.
mac-holdtime specifies the duration for which a switch maintains SYNC MAC entries after the switch deletes the EVPN type-2 route of the Ethernet segment peer. During this time, the switch attempts to independently establish reachability of the MAC address on the local Ethernet segment. The hold time can be between 0 and 86400 seconds. The default is 1080 seconds.
neigh-holdtime specifies the duration for which a switch maintains SYNC neighbor entries after the switch deletes the EVPN type-2 route of the Ethernet segment peer. During this time, the switch attempts to independently establish reachability of the host on the local Ethernet segment. The hold time can be between 0 and 86400 seconds. The default is 1080 seconds.
redirect-off disables fast failover of traffic destined to the access port through the VXLAN overlay. This only applies to Cumulus VX.
startup-delay specifies the duration for which a switch holds the Ethernet segment-bond in a protodown state after a reboot or process restart. This allows the initialization of the VXLAN overlay to complete. The delay can be between 0 and 216000 seconds. The default is 180 seconds.
To configure a MAC hold time for 1000 seconds, run the following commands:
You can add debug statements to the /etc/frr/frr.conf file to debug the Ethernet segments, routes, and routing protocols (via Zebra).
Cumulus Linux does not provide NVUE commands for FRR Debugging; however, you can create a snippet to enable FRR debugging. Refer to /etc/frr/frr.conf snippets.
When an Ethernet segment link goes down, the attached VTEP notifies all other VTEPs using a single EAD-ES withdraw. Cumulus Linux uses an Ethernet segment bond redirect.
Fast failover also triggers:
When you reboot a leaf switch or VTEP.
When there is an uplink failure. When all uplinks are down, the Ethernet segment bonds on the switch are protodown or error disabled.
Disable Next Hop Group Sharing in the ASIC
When you configure EVPN-MH, container sharing for both layer 2 and layer 3 next hop groups is on by default. The switch stores these settings in the evpn.multihoming.shared_l2_groups and evpn.multihoming.shared_l3_groups variables.
Disabling container sharing allows for faster failover when an Ethernet segment link flaps.
To disable either setting, edit the switchd.conf file, set the variable to FALSE, then restart the switchd service. For example, to disable container sharing for layer 3 next hop groups:
RFC 7432 requires the switch to advertise type-1/EAD (Ethernet Auto-discovery) routes:
As EAD-per-ES (Ethernet Auto-discovery per Ethernet segment) routes
As EAD-per-EVI (Ethernet Auto-discovery per EVPN instance) routes
Some third party switch vendors do not advertise EAD-per-EVI routes; they only advertise EAD-per-ES routes. To interoperate with these vendors, you need to disable EAD-per-EVI route advertisements.
To remove the dependency on EAD-per-EVI routes and activate the VTEP upon receiving the EAD-per-ES route:
cumulus@switch:~$ nv set evpn multihoming ead-evi-route rx off
cumulus@switch:~$ nv config apply
To suppress the advertisement of EAD-per-EVI routes, run:
cumulus@switch:~$ nv set evpn multihoming ead-evi-route tx off
cumulus@switch:~$ nv config apply
Use the following commands to troubleshoot your EVPN multihoming configuration.
Show Global EVPN-MH Information
To show global EVPN-MH information, such as the uplink count, startup delay timer, neighbor hold time, and MAC entry hold time, run the NVUE nv show evpn multihoming command:
cumulus@switch:~$ nv show evpn multihoming
operational applied
------------------- ----------- -------
enable on
mac-holdtime 1080 1080
neighbor-holdtime 1080 1080
startup-delay 180 180
ead-evi-route
rx on
tx on
segment
df-preference 32767
startup-delay-timer --:--:--
uplink-active 2
uplink-count 2
Show Ethernet Segment Information
To display the Ethernet segments across all VNIs, run the nv show evpn multihoming esi -o json command or the vtysh show evpn es command. For example:
cumulus@switch:~$ sudo vtysh
...
switch# show evpn es
Type: B bypass, L local, R remote, N non-DF
ESI Type ES-IF VTEPs
03:44:38:39:be:ef:aa:00:00:01 LB bond1
03:44:38:39:be:ef:aa:00:00:02 LB bond2
03:44:38:39:be:ef:aa:00:00:03 LB bond3
Show Ethernet Segment per VNI Information
To display the Ethernet segments learned for each VNI, run the vtysh show evpn es-evi command. For example:
cumulus@switch:~$ sudo vtysh
...
switch# show evpn es-evi
Type: L local, R remote
Type: L local, R remote
VNI ESI Type
20 03:44:38:39:be:ef:aa:00:00:02 L
30 03:44:38:39:be:ef:aa:00:00:03 L
10 03:44:38:39:be:ef:aa:00:00:01 L
To display the Ethernet segments for a specific VNI, run the NVUE nv show evpn vni <vni> multihoming esi command. For example:
cumulus@switch:~$ nv show evpn vni 10 multihoming esi
type.local type.remote
----------------------------- ---------- -----------
03:44:38:39:be:ef:aa:00:00:01 on
Show BGP Ethernet Segment Information
To display the Ethernet segments across all VNIs learned via type-1 and type-4 routes, run the NVUE nv show evpn multihoming bgp-info esi -o json command or the vtysh show bgp l2vpn evpn es command. For example:
To display the Ethernet segments per VNI learned via type-1 and type-4 routes, run the vtysh show bgp l2vpn evpn es-evi command.
cumulus@switch:~$ sudo vtysh
...
switch# show bgp l2vpn evpn es-evi
Flags: L local, R remote, I inconsistent
VTEP-Flags: E EAD-per-ES, V EAD-per-EVI
VNI ESI Flags VTEPs
20 03:44:38:39:be:ef:aa:00:00:02 LR 10.10.10.2(V)
20 03:44:38:39:be:ef:bb:00:00:02 R 10.10.10.3(V),10.10.10.4(V)
30 03:44:38:39:be:ef:aa:00:00:03 LR 10.10.10.2(V)
30 03:44:38:39:be:ef:bb:00:00:03 R 10.10.10.3(V),10.10.10.4(V)
10 03:44:38:39:be:ef:aa:00:00:01 LR 10.10.10.2(V)
10 03:44:38:39:be:ef:bb:00:00:01 R 10.10.10.3(V),10.10.10.4(V)
...
Show EAD Route Types
To view type-1 EAD routes, run the NVUE vtysh show bgp l2vpn evpn route type ead command. For example:
cumulus@switch:~$ sudo vtysh
...
switch# show bgp l2vpn evpn route type ead
BGP table version is 3, local router ID is 10.10.10.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]:[EthTag]:[IPlen]:[VTEP-IP]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]
Network Next Hop Metric LocPrf Weight Path
Extended Community
Route Distinguisher: 10.10.10.1:2
*> [1]:[0]:[03:44:38:39:be:ef:aa:00:00:02]:[128]:[0.0.0.0]
10.10.10.1 32768 i
ET:8 RT:65101:20
Route Distinguisher: 10.10.10.1:6
*> [1]:[0]:[03:44:38:39:be:ef:aa:00:00:03]:[128]:[0.0.0.0]
10.10.10.1 32768 i
ET:8 RT:65101:30
Route Distinguisher: 10.10.10.1:7
*> [1]:[0]:[03:44:38:39:be:ef:aa:00:00:01]:[128]:[0.0.0.0]
10.10.10.1 32768 i
ET:8 RT:65101:10
Route Distinguisher: 10.10.10.2:2
*> [1]:[0]:[03:44:38:39:be:ef:aa:00:00:02]:[32]:[0.0.0.0]
10.10.10.2 0 65199 65102 i
RT:65102:20 ET:8
Route Distinguisher: 10.10.10.2:6
*> [1]:[0]:[03:44:38:39:be:ef:aa:00:00:03]:[32]:[0.0.0.0]
10.10.10.2 0 65199 65102 i
RT:65102:30 ET:8
Route Distinguisher: 10.10.10.2:7
*> [1]:[0]:[03:44:38:39:be:ef:aa:00:00:01]:[32]:[0.0.0.0]
10.10.10.2 0 65199 65102 i
RT:65102:10 ET:8
Route Distinguisher: 10.10.10.3:2
*> [1]:[0]:[03:44:38:39:be:ef:bb:00:00:02]:[32]:[0.0.0.0]
10.10.10.3 0 65199 65103 i
RT:65103:20 ET:8
...
Configuration Example
The following configuration examples use the topology illustrated below and configure EVPN multihoming with head end replication using single VXLAN devices. The examples provide configuration for server01 through server04. The configuration for server05 and server06 are not included for simplicity.
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp1-3,swp51-52
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond3 bond member swp3
cumulus@leaf01:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond1 link mtu 9000
cumulus@leaf01:~$ nv set interface bond2 link mtu 9000
cumulus@leaf01:~$ nv set interface bond3 link mtu 9000
cumulus@leaf01:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf01:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf01:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf01:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf01:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@leaf01:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf01:~$ nv set interface vlan10 ip vrr mac-address 00:00:00:00:00:10
cumulus@leaf01:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf01:~$ nv set interface vlan20 ip address 10.1.20.2/24
cumulus@leaf01:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf01:~$ nv set interface vlan20 ip vrr mac-address 00:00:00:00:00:20
cumulus@leaf01:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf01:~$ nv set interface vlan30 ip address 10.1.30.2/24
cumulus@leaf01:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf01:~$ nv set interface vlan30 ip vrr mac-address 00:00:00:00:00:30
cumulus@leaf01:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf01:~$ nv set vrf RED
cumulus@leaf01:~$ nv set vrf BLUE
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf01:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf01:~$ nv set bridge domain br_default vlan 30 vni 30
cumulus@leaf01:~$ nv set interface vlan10 ip vrf RED
cumulus@leaf01:~$ nv set interface vlan20 ip vrf RED
cumulus@leaf01:~$ nv set interface vlan30 ip vrf BLUE
cumulus@leaf01:~$ nv set nve vxlan source address 10.10.10.1
cumulus@leaf01:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf01:~$ nv set vrf RED evpn vni 4001
cumulus@leaf01:~$ nv set vrf BLUE evpn vni 4002
cumulus@leaf01:~$ nv set evpn enable on
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf01:~$ nv set vrf RED router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set vrf RED router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf01:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf01:~$ nv set vrf BLUE router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set vrf BLUE router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf01:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf01:~$ nv set evpn multihoming enable on
cumulus@leaf01:~$ nv set interface bond1 evpn multihoming segment local-id 1
cumulus@leaf01:~$ nv set interface bond2 evpn multihoming segment local-id 2
cumulus@leaf01:~$ nv set interface bond3 evpn multihoming segment local-id 3
cumulus@leaf01:~$ nv set interface bond1-3 evpn multihoming segment mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set interface bond1-3 evpn multihoming segment df-preference 50000
cumulus@leaf01:~$ nv set interface swp51-52 evpn multihoming uplink on
cumulus@leaf01:~$ nv config apply
cumulus@leaf02:~$ nv set interface lo ip address 10.10.10.2/32
cumulus@leaf02:~$ nv set interface swp1-3,swp51-52
cumulus@leaf02:~$ nv set interface bond1 bond member swp1
cumulus@leaf02:~$ nv set interface bond2 bond member swp2
cumulus@leaf02:~$ nv set interface bond3 bond member swp3
cumulus@leaf02:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond1 link mtu 9000
cumulus@leaf02:~$ nv set interface bond2 link mtu 9000
cumulus@leaf02:~$ nv set interface bond3 link mtu 9000
cumulus@leaf02:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf02:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf02:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf02:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf02:~$ nv set interface vlan10 ip address 10.1.10.3/24
cumulus@leaf02:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf02:~$ nv set interface vlan10 ip vrr mac-address 00:00:00:00:00:10
cumulus@leaf02:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf02:~$ nv set interface vlan20 ip address 10.1.20.3/24
cumulus@leaf02:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf02:~$ nv set interface vlan20 ip vrr mac-address 00:00:00:00:00:20
cumulus@leaf02:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf02:~$ nv set interface vlan30 ip address 10.1.30.3/24
cumulus@leaf02:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf02:~$ nv set interface vlan30 ip vrr mac-address 00:00:00:00:00:30
cumulus@leaf02:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf02:~$ nv set vrf RED
cumulus@leaf02:~$ nv set vrf BLUE
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf02:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf02:~$ nv set bridge domain br_default vlan 30 vni 30
cumulus@leaf02:~$ nv set interface vlan10 ip vrf RED
cumulus@leaf02:~$ nv set interface vlan20 ip vrf RED
cumulus@leaf02:~$ nv set interface vlan30 ip vrf BLUE
cumulus@leaf02:~$ nv set nve vxlan source address 10.10.10.2
cumulus@leaf02:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf02:~$ nv set vrf RED evpn vni 4001
cumulus@leaf02:~$ nv set vrf BLUE evpn vni 4002
cumulus@leaf02:~$ nv set evpn enable on
cumulus@leaf02:~$ nv set router bgp autonomous-system 65102
cumulus@leaf02:~$ nv set router bgp router-id 10.10.10.2
cumulus@leaf02:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf02:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf02:~$ nv set vrf RED router bgp autonomous-system 65102
cumulus@leaf02:~$ nv set vrf RED router bgp router-id 10.10.10.2
cumulus@leaf02:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf02:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf02:~$ nv set vrf BLUE router bgp autonomous-system 65102
cumulus@leaf02:~$ nv set vrf BLUE router bgp router-id 10.10.10.2
cumulus@leaf02:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf02:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf02:~$ nv set evpn multihoming enable on
cumulus@leaf02:~$ nv set interface bond1 evpn multihoming segment local-id 1
cumulus@leaf02:~$ nv set interface bond2 evpn multihoming segment local-id 2
cumulus@leaf02:~$ nv set interface bond3 evpn multihoming segment local-id 3
cumulus@leaf02:~$ nv set interface bond1-3 evpn multihoming segment mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:~$ nv set interface bond1-3 evpn multihoming segment df-preference 50000
cumulus@leaf02:~$ nv set interface swp51-52 evpn multihoming uplink on
cumulus@leaf02:~$ nv config apply
cumulus@leaf03:~$ nv set interface lo ip address 10.10.10.3/32
cumulus@leaf03:~$ nv set interface swp1-3,swp51-52
cumulus@leaf03:~$ nv set interface bond1 bond member swp1
cumulus@leaf03:~$ nv set interface bond2 bond member swp2
cumulus@leaf03:~$ nv set interface bond3 bond member swp3
cumulus@leaf03:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf03:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf03:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf03:~$ nv set interface bond1 link mtu 9000
cumulus@leaf03:~$ nv set interface bond2 link mtu 9000
cumulus@leaf03:~$ nv set interface bond3 link mtu 9000
cumulus@leaf03:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf03:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf03:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf03:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf03:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf03:~$ nv set interface vlan10 ip address 10.1.10.4/24
cumulus@leaf03:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf03:~$ nv set interface vlan10 ip vrr mac-address 00:00:00:00:00:10
cumulus@leaf03:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf03:~$ nv set interface vlan20 ip address 10.1.20.4/24
cumulus@leaf03:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf03:~$ nv set interface vlan20 ip vrr mac-address 00:00:00:00:00:20
cumulus@leaf03:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf03:~$ nv set interface vlan30 ip address 10.1.30.4/24
cumulus@leaf03:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf03:~$ nv set interface vlan30 ip vrr mac-address 00:00:00:00:00:30
cumulus@leaf03:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf03:~$ nv set vrf RED
cumulus@leaf03:~$ nv set vrf BLUE
cumulus@leaf03:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf03:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf03:~$ nv set bridge domain br_default vlan 30 vni 30
cumulus@leaf03:~$ nv set interface vlan10 ip vrf RED
cumulus@leaf03:~$ nv set interface vlan20 ip vrf RED
cumulus@leaf03:~$ nv set interface vlan30 ip vrf BLUE
cumulus@leaf03:~$ nv set nve vxlan source address 10.10.10.3
cumulus@leaf03:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf03:~$ nv set vrf RED evpn vni 4001
cumulus@leaf03:~$ nv set vrf BLUE evpn vni 4002
cumulus@leaf03:~$ nv set evpn enable on
cumulus@leaf03:~$ nv set router bgp autonomous-system 65103
cumulus@leaf03:~$ nv set router bgp router-id 10.10.10.3
cumulus@leaf03:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf03:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf03:~$ nv set vrf RED router bgp autonomous-system 65103
cumulus@leaf03:~$ nv set vrf RED router bgp router-id 10.10.10.3
cumulus@leaf03:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf03:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf03:~$ nv set vrf BLUE router bgp autonomous-system 65103
cumulus@leaf03:~$ nv set vrf BLUE router bgp router-id 10.10.10.3
cumulus@leaf03:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf03:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf03:~$ nv set evpn multihoming enable on
cumulus@leaf03:~$ nv set interface bond1 evpn multihoming segment local-id 1
cumulus@leaf03:~$ nv set interface bond2 evpn multihoming segment local-id 2
cumulus@leaf03:~$ nv set interface bond3 evpn multihoming segment local-id 3
cumulus@leaf03:~$ nv set interface bond1-3 evpn multihoming segment mac-address 44:38:39:BE:EF:BB
cumulus@leaf03:~$ nv set interface bond1-3 evpn multihoming segment df-preference 50000
cumulus@leaf03:~$ nv set interface swp51-52 evpn multihoming uplink on
cumulus@leaf03:~$ nv config apply
cumulus@leaf04:~$ nv set interface lo ip address 10.10.10.4/32
cumulus@leaf04:~$ nv set interface swp1-3,swp51-52
cumulus@leaf04:~$ nv set interface bond1 bond member swp1
cumulus@leaf04:~$ nv set interface bond2 bond member swp2
cumulus@leaf04:~$ nv set interface bond3 bond member swp3
cumulus@leaf04:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf04:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf04:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf04:~$ nv set interface bond1 link mtu 9000
cumulus@leaf04:~$ nv set interface bond2 link mtu 9000
cumulus@leaf04:~$ nv set interface bond3 link mtu 9000
cumulus@leaf04:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf04:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf04:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf04:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf04:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf04:~$ nv set interface vlan10 ip address 10.1.10.5/24
cumulus@leaf04:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf04:~$ nv set interface vlan10 ip vrr mac-address 00:00:00:00:00:10
cumulus@leaf04:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf04:~$ nv set interface vlan20 ip address 10.1.20.5/24
cumulus@leaf04:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf04:~$ nv set interface vlan20 ip vrr mac-address 00:00:00:00:00:20
cumulus@leaf04:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf04:~$ nv set interface vlan30 ip address 10.1.30.5/24
cumulus@leaf04:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf04:~$ nv set interface vlan30 ip vrr mac-address 00:00:00:00:00:30
cumulus@leaf04:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf04:~$ nv set vrf RED
cumulus@leaf04:~$ nv set vrf BLUE
cumulus@leaf04:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf04:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf04:~$ nv set bridge domain br_default vlan 30 vni 30
cumulus@leaf04:~$ nv set interface vlan10 ip vrf RED
cumulus@leaf04:~$ nv set interface vlan20 ip vrf RED
cumulus@leaf04:~$ nv set interface vlan30 ip vrf BLUE
cumulus@leaf04:~$ nv set nve vxlan source address 10.10.10.4
cumulus@leaf04:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf04:~$ nv set vrf RED evpn vni 4001
cumulus@leaf04:~$ nv set vrf BLUE evpn vni 4002
cumulus@leaf04:~$ nv set evpn enable on
cumulus@leaf04:~$ nv set router bgp autonomous-system 65104
cumulus@leaf04:~$ nv set router bgp router-id 10.10.10.4
cumulus@leaf04:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf04:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf04:~$ nv set vrf RED router bgp autonomous-system 65104
cumulus@leaf04:~$ nv set vrf RED router bgp router-id 10.10.10.4
cumulus@leaf04:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf04:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf04:~$ nv set vrf BLUE router bgp autonomous-system 65104
cumulus@leaf04:~$ nv set vrf BLUE router bgp router-id 10.10.10.4
cumulus@leaf04:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf04:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf04:~$ nv set evpn multihoming enable on
cumulus@leaf04:~$ nv set interface bond1 evpn multihoming segment local-id 1
cumulus@leaf04:~$ nv set interface bond2 evpn multihoming segment local-id 2
cumulus@leaf04:~$ nv set interface bond3 evpn multihoming segment local-id 3
cumulus@leaf04:~$ nv set interface bond1-3 evpn multihoming segment mac-address 44:38:39:BE:EF:BB
cumulus@leaf04:~$ nv set interface bond1-3 evpn multihoming segment df-preference 50000
cumulus@leaf04:~$ nv set interface swp51-52 evpn multihoming uplink on
cumulus@leaf04:~$ nv config apply
cumulus@spine01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp1-4
cumulus@spine01:~$ nv set router bgp autonomous-system 65199
cumulus@spine01:~$ nv set router bgp router-id 10.10.10.101
cumulus@spine01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp1 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp2 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp3 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp4 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@spine01:~$ nv config apply
cumulus@spine02:~$ nv set interface lo ip address 10.10.10.102/32
cumulus@spine02:~$ nv set interface swp1-4
cumulus@spine02:~$ nv set router bgp autonomous-system 65199
cumulus@spine02:~$ nv set router bgp router-id 10.10.10.102
cumulus@spine02:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp1 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp2 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp3 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp4 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine02:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@spine02:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@spine02:~$ nv config apply
cumulus@leaf01:~$ cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'10':
vni:
'10': {}
'20':
vni:
'20': {}
'30':
vni:
'30': {}
evpn:
enable: on
multihoming:
enable: on
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
bridge:
domain:
br_default:
access: 10
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 1
mac-address: 44:38:39:BE:EF:AA
link:
mtu: 9000
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
bridge:
domain:
br_default:
access: 20
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 2
mac-address: 44:38:39:BE:EF:AA
link:
mtu: 9000
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
bridge:
domain:
br_default:
access: 30
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 3
mac-address: 44:38:39:BE:EF:AA
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.1/32: {}
type: loopback
swp1:
type: swp
swp2:
type: swp
swp3:
type: swp
swp51:
evpn:
multihoming:
uplink: on
type: swp
swp52:
evpn:
multihoming:
uplink: on
type: swp
vlan10:
ip:
address:
10.1.10.2/24: {}
vrf: RED
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:00:00:00:10
state:
up: {}
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.2/24: {}
vrf: RED
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:00:00:00:20
state:
up: {}
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.2/24: {}
vrf: BLUE
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:00:00:00:30
state:
up: {}
type: svi
vlan: 30
nve:
vxlan:
arp-nd-suppress: on
enable: on
source:
address: 10.10.10.1
router:
bgp:
autonomous-system: 65101
enable: on
router-id: 10.10.10.1
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:AA
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65101
enable: on
router-id: 10.10.10.1
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65101
enable: on
router-id: 10.10.10.1
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@leaf02:~$ cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'10':
vni:
'10': {}
'20':
vni:
'20': {}
'30':
vni:
'30': {}
evpn:
enable: on
multihoming:
enable: on
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
bridge:
domain:
br_default:
access: 10
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 1
mac-address: 44:38:39:BE:EF:AA
link:
mtu: 9000
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
bridge:
domain:
br_default:
access: 20
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 2
mac-address: 44:38:39:BE:EF:AA
link:
mtu: 9000
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
bridge:
domain:
br_default:
access: 30
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 3
mac-address: 44:38:39:BE:EF:AA
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.2/32: {}
type: loopback
swp1:
type: swp
swp2:
type: swp
swp3:
type: swp
swp51:
evpn:
multihoming:
uplink: on
type: swp
swp52:
evpn:
multihoming:
uplink: on
type: swp
vlan10:
ip:
address:
10.1.10.3/24: {}
vrf: RED
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:00:00:00:10
state:
up: {}
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.3/24: {}
vrf: RED
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:00:00:00:20
state:
up: {}
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.3/24: {}
vrf: BLUE
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:00:00:00:30
state:
up: {}
type: svi
vlan: 30
nve:
vxlan:
arp-nd-suppress: on
enable: on
source:
address: 10.10.10.2
router:
bgp:
autonomous-system: 65102
enable: on
router-id: 10.10.10.2
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:AA
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65102
enable: on
router-id: 10.10.10.2
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65102
enable: on
router-id: 10.10.10.2
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@leaf03:~$ cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'10':
vni:
'10': {}
'20':
vni:
'20': {}
'30':
vni:
'30': {}
evpn:
enable: on
multihoming:
enable: on
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
bridge:
domain:
br_default:
access: 10
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 1
mac-address: 44:38:39:BE:EF:BB
link:
mtu: 9000
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
bridge:
domain:
br_default:
access: 20
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 2
mac-address: 44:38:39:BE:EF:BB
link:
mtu: 9000
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
bridge:
domain:
br_default:
access: 30
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 3
mac-address: 44:38:39:BE:EF:BB
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.3/32: {}
type: loopback
swp1:
type: swp
swp2:
type: swp
swp3:
type: swp
swp51:
evpn:
multihoming:
uplink: on
type: swp
swp52:
evpn:
multihoming:
uplink: on
type: swp
vlan10:
ip:
address:
10.1.10.4/24: {}
vrf: RED
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:00:00:00:10
state:
up: {}
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.4/24: {}
vrf: RED
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:00:00:00:20
state:
up: {}
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.4/24: {}
vrf: BLUE
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:00:00:00:30
state:
up: {}
type: svi
vlan: 30
nve:
vxlan:
arp-nd-suppress: on
enable: on
source:
address: 10.10.10.3
router:
bgp:
autonomous-system: 65103
enable: on
router-id: 10.10.10.3
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:AA
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65103
enable: on
router-id: 10.10.10.3
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65103
enable: on
router-id: 10.10.10.3
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@leaf04:~$ cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'10':
vni:
'10': {}
'20':
vni:
'20': {}
'30':
vni:
'30': {}
evpn:
enable: on
multihoming:
enable: on
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
bridge:
domain:
br_default:
access: 10
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 1
mac-address: 44:38:39:BE:EF:BB
link:
mtu: 9000
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
bridge:
domain:
br_default:
access: 20
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 2
mac-address: 44:38:39:BE:EF:BB
link:
mtu: 9000
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
bridge:
domain:
br_default:
access: 30
evpn:
multihoming:
segment:
df-preference: 50000
enable: on
local-id: 3
mac-address: 44:38:39:BE:EF:BB
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.4/32: {}
type: loopback
swp1:
type: swp
swp2:
type: swp
swp3:
type: swp
swp51:
evpn:
multihoming:
uplink: on
type: swp
swp52:
evpn:
multihoming:
uplink: on
type: swp
vlan10:
ip:
address:
10.1.10.5/24: {}
vrf: RED
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:00:00:00:10
state:
up: {}
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.5/24: {}
vrf: RED
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:00:00:00:20
state:
up: {}
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.5/24: {}
vrf: BLUE
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:00:00:00:30
state:
up: {}
type: svi
vlan: 30
nve:
vxlan:
arp-nd-suppress: on
enable: on
source:
address: 10.10.10.4
router:
bgp:
autonomous-system: 65104
enable: on
router-id: 10.10.10.4
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:AA
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65104
enable: on
router-id: 10.10.10.4
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65104
enable: on
router-id: 10.10.10.4
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@leaf01:~$ cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
vxlan-local-tunnelip 10.10.10.1
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
mtu 9000
es-sys-mac 44:38:39:BE:EF:AA
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 10
auto bond2
iface bond2
mtu 9000
es-sys-mac 44:38:39:BE:EF:AA
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 20
auto bond3
iface bond3
mtu 9000
es-sys-mac 44:38:39:BE:EF:AA
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 30
auto vlan10
iface vlan10
address 10.1.10.2/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:b1
vrf RED
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.2/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
hwaddress 44:38:39:22:01:b1
vrf RED
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.2/24
address-virtual 00:00:00:00:00:30 10.1.30.1/24
hwaddress 44:38:39:22:01:b1
vrf BLUE
vlan-raw-device br_default
vlan-id 30
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20 30=30
bridge-vids 10 20 30
bridge-learning off
auto vlan220_l3
iface vlan220_l3
vrf RED
vlan-raw-device br_l3vni
vlan-id 220
auto vlan297_l3
iface vlan297_l3
vrf BLUE
vlan-raw-device br_l3vni
vlan-id 297
auto vxlan99
iface vxlan99
bridge-vlan-vni-map 220=4001 297=4002
bridge-vids 220 297
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 vxlan48
hwaddress 44:38:39:22:01:b1
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
auto br_l3vni
iface br_l3vni
bridge-ports vxlan99
hwaddress 44:38:39:22:01:b1
bridge-vlan-aware yes
cumulus@leaf02:~$ cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.2/32
vxlan-local-tunnelip 10.10.10.2
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
mtu 9000
es-sys-mac 44:38:39:BE:EF:AA
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 10
auto bond2
iface bond2
mtu 9000
es-sys-mac 44:38:39:BE:EF:AA
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 20
auto bond3
iface bond3
mtu 9000
es-sys-mac 44:38:39:BE:EF:AA
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 30
auto vlan10
iface vlan10
address 10.1.10.3/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:af
vrf RED
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.3/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
hwaddress 44:38:39:22:01:af
vrf RED
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.3/24
address-virtual 00:00:00:00:00:30 10.1.30.1/24
hwaddress 44:38:39:22:01:af
vrf BLUE
vlan-raw-device br_default
vlan-id 30
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20 30=30
bridge-vids 10 20 30
bridge-learning off
auto vlan220_l3
iface vlan220_l3
vrf RED
vlan-raw-device br_l3vni
vlan-id 220
auto vlan297_l3
iface vlan297_l3
vrf BLUE
vlan-raw-device br_l3vni
vlan-id 297
auto vxlan99
iface vxlan99
bridge-vlan-vni-map 220=4001 297=4002
bridge-vids 220 297
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 vxlan48
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
auto br_l3vni
iface br_l3vni
bridge-ports vxlan99
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
cumulus@leaf03:~$ cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.3/32
vxlan-local-tunnelip 10.10.10.3
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
mtu 9000
es-sys-mac 44:38:39:BE:EF:BB
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 10
auto bond2
iface bond2
mtu 9000
es-sys-mac 44:38:39:BE:EF:BB
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 20
auto bond3
iface bond3
mtu 9000
es-sys-mac 44:38:39:BE:EF:BB
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 30
auto vlan10
iface vlan10
address 10.1.10.4/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:bb
vrf RED
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.4/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
hwaddress 44:38:39:22:01:bb
vrf RED
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.4/24
address-virtual 00:00:00:00:00:30 10.1.30.1/24
hwaddress 44:38:39:22:01:bb
vrf BLUE
vlan-raw-device br_default
vlan-id 30
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20 30=30
bridge-vids 10 20 30
bridge-learning off
auto vlan220_l3
iface vlan220_l3
vrf RED
vlan-raw-device br_l3vni
vlan-id 220
auto vlan297_l3
iface vlan297_l3
vrf BLUE
vlan-raw-device br_l3vni
vlan-id 297
auto vxlan99
iface vxlan99
bridge-vlan-vni-map 220=4001 297=4002
bridge-vids 220 297
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 vxlan48
hwaddress 44:38:39:22:01:bb
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
auto br_l3vni
iface br_l3vni
bridge-ports vxlan99
hwaddress 44:38:39:22:01:bb
bridge-vlan-aware yes
cumulus@leaf04:~$ cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.4/32
vxlan-local-tunnelip 10.10.10.4
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
mtu 9000
es-sys-mac 44:38:39:BE:EF:BB
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 10
auto bond2
iface bond2
mtu 9000
es-sys-mac 44:38:39:BE:EF:BB
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 20
auto bond3
iface bond3
mtu 9000
es-sys-mac 44:38:39:BE:EF:BB
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
bridge-access 30
auto vlan10
iface vlan10
address 10.1.10.5/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:c1
vrf RED
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.5/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
hwaddress 44:38:39:22:01:c1
vrf RED
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.5/24
address-virtual 00:00:00:00:00:30 10.1.30.1/24
hwaddress 44:38:39:22:01:c1
vrf BLUE
vlan-raw-device br_default
vlan-id 30
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20 30=30
bridge-vids 10 20 30
bridge-learning off
auto vlan220_l3
iface vlan220_l3
vrf RED
vlan-raw-device br_l3vni
vlan-id 220
auto vlan297_l3
iface vlan297_l3
vrf BLUE
vlan-raw-device br_l3vni
vlan-id 297
auto vxlan99
iface vxlan99
bridge-vlan-vni-map 220=4001 297=4002
bridge-vids 220 297
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 vxlan48
hwaddress 44:38:39:22:01:c1
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
auto br_l3vni
iface br_l3vni
bridge-ports vxlan99
hwaddress 44:38:39:22:01:c1
bridge-vlan-aware yes
cumulus@spine01:~$ cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
cumulus@spine02:~$ cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.102/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
cumulus@server01:~$ sudo cat /etc/network/interfaces
# The loopback network interface
auto lo
iface lo inet loopback
# The OOB network interface
auto eth0
iface eth0 inet dhcp
# The data plane network interfaces
auto eth1
iface eth1 inet manual
# Required for Vagrant
post-up ip link set promisc on dev eth1
auto eth2
iface eth2 inet manual
# Required for Vagrant
post-up ip link set promisc on dev eth2
auto uplink
iface uplink inet static
address 10.1.10.101
netmask 255.255.255.0
mtu 9000
bond-slaves eth1 eth2
bond-mode 802.3ad
bond-miimon 100
bond-lacp-rate 1
bond-min-links 1
bond-xmit-hash-policy layer3+4
post-up ip route add 10.0.0.0/8 via 10.1.10.1
cumulus@server02:~$ sudo cat /etc/network/interfaces
# The loopback network interface
auto lo
iface lo inet loopback
# The OOB network interface
auto eth0
iface eth0 inet dhcp
# The data plane network interfaces
auto eth1
iface eth1 inet manual
# Required for Vagrant
post-up ip link set promisc on dev eth1
auto eth2
iface eth2 inet manual
# Required for Vagrant
post-up ip link set promisc on dev eth2
auto uplink
iface uplink inet static
address 10.1.20.102
netmask 255.255.255.0
mtu 9000
bond-slaves eth1 eth2
bond-mode 802.3ad
bond-miimon 100
bond-lacp-rate 1
bond-min-links 1
bond-xmit-hash-policy layer3+4
post-up ip route add 10.0.0.0/8 via 10.1.20.1
cumulus@server03:~$ sudo cat /etc/network/interfaces
# The loopback network interface
auto lo
iface lo inet loopback
# The OOB network interface
auto eth0
iface eth0 inet dhcp
# The data plane network interfaces
auto eth1
iface eth1 inet manual
# Required for Vagrant
post-up ip link set promisc on dev eth1
auto eth2
iface eth2 inet manual
# Required for Vagrant
post-up ip link set promisc on dev eth2
auto uplink
iface uplink inet static
address 10.1.30.103
netmask 255.255.255.0
mtu 9000
bond-slaves eth1 eth2
bond-mode 802.3ad
bond-miimon 100
bond-lacp-rate 1
bond-min-links 1
bond-xmit-hash-policy layer3+4
post-up ip route add 10.0.0.0/8 via 10.1.30.1
cumulus@server04:~$ sudo cat /etc/network/interfaces
# The loopback network interface
auto lo
iface lo inet loopback
# The OOB network interface
auto eth0
iface eth0 inet dhcp
# The data plane network interfaces
auto eth1
iface eth1 inet manual
# Required for Vagrant
post-up ip link set promisc on dev eth1
auto eth2
iface eth2 inet manual
# Required for Vagrant
post-up ip link set promisc on dev eth2
auto uplink
iface uplink inet static
address 10.1.10.104
netmask 255.255.255.0
mtu 9000
bond-slaves eth1 eth2
bond-mode 802.3ad
bond-miimon 100
bond-lacp-rate 1
bond-min-links 1
bond-xmit-hash-policy layer3+4
post-up ip route add 10.0.0.0/8 via 10.1.10.1
This simulation starts with the EVPN-MH with Head End Replication configuration. The demo is pre-configured using NVUE commands.
Run the vtysh show evpn es command to show the Ethernet segments across all VNIs.
Run the vtysh show bgp l2vpn evpn route type ead command to show the type-1 EAD routes.
To further validate the configuration, run the commands shown in the troubleshooting section below.
When you run the nv set vrf RED evpn vni 4001 and the nv set vrf BLUE evpn vni 4002 commands, NVUE creates the following in the /etc/network/interfaces file:
Creates a single VXLAN device (vxlan99)
Assigns two VLANs automatically from the reserved VLAN range and adds _l3 (layer 3) at the end (for example vlan220_l3 and vlan297_l3)
Maps the VLANs to the VNIs (bridge-vlan-vni-map 220=4001 297=4002)
Creates a layer 3 bridge called br_l3vni
Reserves and assigns a dedicated hardware address for the layer 3 bridge from the pool of MAC addresses available on the switch
Adds the VXLAN device to the br_l3vni bridge
Assigns vlan220_l3 to vrf RED and vlan297_l3 to vrf BLUE
cumulus@leaf01:~$ sudo cat /etc/network/interfaces
...
auto vlan220_l3
iface vlan220_l3
vrf RED
vlan-raw-device br_l3vni
vlan-id 220
auto vlan297_l3
iface vlan297_l3
vrf BLUE
vlan-raw-device br_l3vni
vlan-id 297
auto vxlan99
iface vxlan99
bridge-vlan-vni-map 220=4001 297=4002
bridge-vids 220 297
bridge-learning off
auto br_l3vni
iface br_l3vni
bridge-ports vxlan99
hwaddress 44:38:39:22:01:b1
bridge-vlan-aware yes
...
EVPN BUM Traffic with PIM-SM
Without EVPN and PIM-SM, HER is the default way to replicate BUM traffic to remote VTEPs, where the ingress VTEP generates the same number of copies as VTEPs for each overlay BUM packet. In certain deployments, this is not optimal.
The following example shows a EVPN-PIM configuration, where underlay multicast distributes BUM traffic. An MDT optimizes the flow of overlay BUM traffic in the underlay network.
In the above example, host01 sends an ARP request to resolve host03. leaf01 (in addition to flooding the packet to host02) sends an encapsulated packet over the underlay network, which the spine forwards using the MDT to leaf02 and leaf03.
For PIM-SM, type-3 routes do not result in any forwarding entries. Cumulus Linux does not advertise type-3 routes for a layer 2 VNI when BUM mode for that VNI is PIM-SM.
Configure Multicast VXLAN Tunnels
To configure multicast VXLAN tunnels, you need to configure PIM-SM in the underlay:
Enable PIM-SM on the appropriate layer 3 interfaces.
In addition to the PIM-SM configuration, you need to run the following commands on each VTEP to provide the layer 2 VNI to MDT mapping.
Run the nv set nve vxlan flooding multicast-group <ip-address> command. For example:
cumulus@switch:~$ nv set nve vxlan flooding multicast-group 224.0.0.10
Edit the /etc/network/interfaces file and add vxlan-mcastgrp <ip-address> to the interface stanza. For example:
cumulus@switch:~$ sudo vi /etc/network/interfaces
...
auto vxlan10
iface vxlan10
vxlan-id 10
vxlan-mcastgrp 224.0.0.10
...
Run the ifreload -a command to load the new configuration:
cumulus@switch:~$ ifreload -a
One multicast group per layer 2 VNI is optimal configuration for underlay bandwidth utilization. However, you can specify the same multicast group for more than one layer 2 VNI.
Verify EVPN-PIM
Run the net show mroute command or the vtysh show ip mroute command to review the multicast route information in FRR. When using EVPN-PIM, every VTEP acts as both source and destination for a VNI-MDT group, therefore, mroute entries on each VTEP should look like this:
cumulus@switch:~$ sudo vtysh
...
switch# show ip mroute
IP Multicast Routing Table
Flags: S - Sparse, C - Connected, P - Pruned
R - RP-bit set, F - Register flag, T - SPT-bit set
Source Group Flags Proto Input Output TTL Uptime
* 224.0.0.10 S IGMP swp54 pimreg 1 23:20:54
ipmr-lo 1
10.10.10.1 224.0.0.10 SFT PIM lo swp51 1 23:20:56
* 224.0.0.20 S IGMP swp53 pimreg 1 23:20:54
ipmr-lo 1
10.10.10.1 224.0.0.20 SFT PIM lo swp52 1 23:20:56
* 224.0.0.30 S IGMP swp51 pimreg 1 23:20:54
ipmr-lo 1
10.10.10.1 224.0.0.30 SFT PIM lo swp53 1 23:20:56
(*,G) entries should show ipmr-lo in the OIL (Outgoing Interface List) and (S,G) entries should show lo as the Source interface or incoming interface and ipmr-lo in the OIL.
Run the ip mroute command to review the multicast route information in the kernel. The kernel information should match the FRR information.
Run the bridge fdb show | grep 00:00:00:00:00:00 command to verify that all zero MAC addresses for every VXLAN device point to the correct multicast group destination.
cumulus@switch:~$ bridge fdb show | grep 00:00:00:00:00:00
00:00:00:00:00:00 dev vxlan10 dst 224.0.0.10 self permanent
00:00:00:00:00:00 dev vxlan20 dst 224.0.0.20 self permanent
The show ip mroute count command, often used to check multicast packet counts does not update for encapsulated BUM traffic originating or terminating on the VTEPs.
Run the net show evpn vni <vni> command or the vtysh show evpn vni <vni> command to ensure that your layer 2 VNI has the correct flooding information:
cumulus@switch:~$ sudo vtysh
switch# show evpn vni 10
VNI: 10
Type: L2
Tenant VRF: default
VxLAN interface: vni10
VxLAN ifIndex: 18
Local VTEP IP: 10.10.10.1
Mcast group: 224.0.0.10 <<<<<<<
Remote VTEPs for this VNI:
10.10.10.3 flood: -
Number of MACs (local and remote) known for this VNI: 6
Number of ARPs (IPv4 and IPv6, local and remote) known for this VNI: 14
Advertise-gw-macip: No
Example Configuration
The following example shows an EVPN-PIM configuration on the VTEP, where:
PIM is on swp51 thru swp54 and the loopback interface (see the example /etc/frr/frr.conf file below).
The group mapping 10.10.100.100 is for a static RP (see the top of the /etc/frr/frr.conf file example below).
Multicast group 224.0.0.10 maps to VNI10, multicast group 224.0.0.20 maps to VNI20, and multicast group 224.0.0.30 maps to VNI30 (see the example /etc/network/interfaces file below).
cumulus@leaf01:~$ sudo cat /etc/frr/frr.conf
...
ip pim rp 10.10.100.100
ip pim keep-alive-timer 3600
ip pim ecmp
service integrated-vtysh-config
vrf BLUE
vni 4002
exit-vrf
vrf RED
vni 4001
exit-vrf
vrf mgmt
ip route 0.0.0.0/0 192.168.200.1
exit-vrf
interface swp51
ip pim
interface swp52
ip pim
interface swp53
ip pim
interface swp54
ip pim
interface lo
ip igmp
ip pim
ip pim use-source 10.10.10.1
router bgp 65101
bgp router-id 10.10.10.1
neighbor underlay peer-group
neighbor underlay remote-as external
neighbor swp51 interface peer-group underlay
neighbor swp52 interface peer-group underlay
neighbor swp53 interface peer-group underlay
neighbor swp54 interface peer-group underlay
!
address-family ipv4 unicast
redistribute connected
exit-address-family
!
address-family l2vpn evpn
neighbor underlay activate
advertise-all-vni
exit-address-family
!
router bgp 65101 vrf RED
bgp router-id 10.10.10.1
!
address-family ipv4 unicast
redistribute connected
exit-address-family
!
address-family l2vpn evpn
advertise ipv4 unicast
exit-address-family
!
router bgp 65101 vrf BLUE
bgp router-id 10.10.10.1
!
address-family ipv4 unicast
redistribute connected
exit-address-family
!
address-family l2vpn evpn
advertise ipv4 unicast
exit-address-family
cumulus@leaf01:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
vxlan-local-tunnelip 10.10.10.1
auto eth0
iface eth0
vrf mgmt
address 192.168.200.11/24
auto mgmt
iface mgmt
vrf-table auto
address 127.0.0.1/8
address ::1/128
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto bridge
iface bridge
bridge-ports bond1 bond2 bond3
bridge-ports vni10 vni20 vni30 vniRED vniBLUE
bridge-vids 10 20 30
bridge-vlan-aware yes
auto vni10
iface vni10
bridge-access 10
vxlan-id 10
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
bridge-learning off
bridge-arp-nd-suppress on
vxlan-mcastgrp 224.0.0.10
auto vni20
iface vni20
bridge-access 20
vxlan-id 20
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
bridge-learning off
bridge-arp-nd-suppress on
vxlan-mcastgrp 224.0.0.20
auto vni30
iface vni30
bridge-access 30
vxlan-id 30
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
bridge-learning off
bridge-arp-nd-suppress on
vxlan-mcastgrp 224.0.0.30
auto vniRED
iface vniRED
bridge-access 4001
vxlan-id 4001
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
bridge-learning off
bridge-arp-nd-suppress on
auto vniBLUE
iface vniBLUE
bridge-access 4002
vxlan-id 4002
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
bridge-learning off
bridge-arp-nd-suppress on
auto vlan10
iface vlan10
address 10.1.10.2/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
vrf RED
vlan-raw-device bridge
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.2/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
vrf RED
vlan-raw-device bridge
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.2/24
address-virtual 00:00:00:00:00:30 10.1.30.1/24
vrf BLUE
vlan-raw-device bridge
vlan-id 30
auto vlan4001
iface vlan4001
hwaddress 44:38:39:BE:EF:AA
vrf RED
vlan-raw-device bridge
vlan-id 4001
auto vlan4002
iface vlan4002
hwaddress 44:38:39:BE:EF:AA
vrf BLUE
vlan-raw-device bridge
vlan-id 4002
auto swp51
iface swp51
alias to spine
auto swp52
iface swp52
alias to spine
auto swp53
iface swp53
alias to spine
auto swp54
iface swp54
alias to spine
auto swp1
iface swp1
alias bond member of bond1
auto bond1
iface bond1
bond-slaves swp1
bridge-access 10
mtu 9000
bond-lacp-bypass-allow yes
mstpctl-bpduguard yes
mstpctl-portadminedge yes
auto swp2
iface swp2
alias bond member of bond2
auto bond2
iface bond2
bond-slaves swp2
bridge-access 20
mtu 9000
bond-lacp-bypass-allow yes
mstpctl-bpduguard yes
mstpctl-portadminedge yes
auto swp3
iface swp3
alias bond member of bond3
auto bond3
iface bond3
bond-slaves swp3
bridge-access 30
mtu 9000
bond-lacp-bypass-allow yes
mstpctl-bpduguard yes
mstpctl-portadminedge yes
Configure EVPN-PIM in VXLAN Active-active Mode
To configure EVPN-PIM with an MLAG pair in VXLAN active-active mode, enable PIM on the peer link subinterface of each MLAG peer switch (in addition to the configuration described in Configure Multicast VXLAN Tunnels, above).
Run the nv set interface <peerlink> router pim command. For example:
cumulus@switch:~$ sudo vtysh
switch# configure terminal
switch(config)# interface peerlink.4094
switch(config-if)# ip pim
switch(config-if)# end
switch# write memory
switch# exit
cumulus@switch:~$
Troubleshooting EVPN
This section provides various commands to help you examine your EVPN configuration and provides troubleshooting tips.
General Commands
You can use various NVUE or Linux commands to examine interfaces, VLAN mappings and the bridge MAC forwarding database known to the Linux kernel. You can also use these commands to examine the neighbor cache and the routing table (for the underlay or for a specific tenant VRF). Some of the key commands are:
ip [-d] link show type vxlan (Linux)
nv show bridge domain <domain> mac-table (NVUE) or bridge [-s] fdb show (Linux)
nv show bridge domain <domain> vlan (NVUE) or bridge vlan show (Linux)
ip neighbor show (Linux)
ip route show [table <vrf-name>] (Linux)
The sample output below shows ip -d link show type vxlan command output for one VXLAN interface. Relevant parameters are the VNI value, the state, the local IP address for the VXLAN tunnel, the UDP port number (4789) and the bridge of which the interface is part (bridge in the example below). The output also shows that MAC learning is off on the VXLAN interface.
cumulus@leaf01:~$ ip -d link show type vxlan
14: vni10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc noqueue master bridge state UP mode DEFAULT group default qlen 1000
link/ether 42:83:73:20:46:ba brd ff:ff:ff:ff:ff:ff promiscuity 1 minmtu 68 maxmtu 65535
vxlan id 10 local 10.0.1.1 srcport 0 0 dstport 4789 nolearning ttl 64 ageing 300 udpcsum noudp6zerocsumtx noudp6zerocsumrx
bridge_slave state forwarding priority 8 cost 100 hairpin off guard off root_block off fastleave off learning off flood on port_id 0x8005 port_no 0x5 designated_port 32773 designated_cost 0 designated_bridge 8000.76:ed:2a:8a:67:24 designated_root 8000.76:ed:2a:8a:67:24 hold_timer 0.00 message_age_timer 0.00 forward_delay_timer 0.00 topology_change_ack 0 config_pending 0 proxy_arp off proxy_arp_wifi off mcast_router 1 mcast_fast_leave off mcast_flood on neigh_suppress on group_fwd_mask 0x0 group_fwd_mask_str 0x0 group_fwd_maskhi 0x0 group_fwd_maskhi_str 0x0 vlan_tunnel off isolated off addrgenmode eui64 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
...
The following shows example output for the nv show bridge domain <domain> mac-table command:
bond1 is in VLAN ID 10.
48:b0:2d:d8:33 is the host MAC address learned on bond1.
A remote VTEP that participates in VLAN ID 10 is 10.0.1.34 (the FDB entries have a MAC address of 48:b0:2d:b4:4e). BUM traffic replication uses these entries.
44:38:39:22:01 is a remote host MAC reachable over the VXLAN tunnel via VTEP 10.0.1.2.
–>
The following example output for the net show neighbor command shows:
10.1.10.101 is a locally attached host server01 on VLAN 10. Interface vlan10-v0 is the virtual VRR address for VLAN10.
10.1.10.104 is remote-host, server04 on VLAN10. The STATE zebra shows that it is an EVPN learned entry. Use net show bridge macs to see information about which VTEP the host is behind.
If you use BGP for the underlay routing, run the vtysh show bgp summary command or the net show bgp summary command to view a summary of the layer 3 fabric connectivity:
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show bgp summary
IPv4 Unicast Summary
BGP router identifier 10.10.10.1, local AS number 65101 vrf-id 0
BGP table version 13
RIB entries 25, using 4800 bytes of memory
Peers 5, using 106 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
spine01(swp51) 4 65199 814 805 0 0 0 00:37:34 7
spine02(swp52) 4 65199 814 805 0 0 0 00:37:34 7
spine03(swp53) 4 65199 814 805 0 0 0 00:37:34 7
spine04(swp54) 4 65199 814 805 0 0 0 00:37:34 7
leaf02(peerlink.4094) 4 65101 766 768 0 0 0 00:37:35 12
Total number of neighbors 5
show bgp ipv6 unicast summary
=============================
% No BGP neighbors found
show bgp l2vpn evpn summary
===========================
BGP router identifier 10.10.10.1, local AS number 65101 vrf-id 0
BGP table version 0
RIB entries 23, using 4416 bytes of memory
Peers 4, using 85 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
spine01(swp51) 4 65199 814 805 0 0 0 00:37:35 34
spine02(swp52) 4 65199 814 805 0 0 0 00:37:35 34
spine03(swp53) 4 65199 814 805 0 0 0 00:37:35 34
spine04(swp54) 4 65199 814 805 0 0 0 00:37:35 34
Total number of neighbors 4
Run the vtysh show ip route command or the net show route command to examine the underlay routing and determine how the switch reaches remote VTEPs. The following example shows output from a leaf switch:
This is the routing table of the global (underlay) routing table. Use the `vrf` keyword to see routes for specific VRFs where the hosts reside.
cumulus@leaf01:mgmt:~$ sudo vtysh
leaf01# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route
C>* 10.0.1.1/32 is directly connected, lo, 00:40:02
B>* 10.0.1.2/32 [20/0] via fe80::2ef3:45ff:fef4:6f5f, swp53, weight 1, 00:40:04
* via fe80::ae56:f0ff:fef3:590c, swp54, weight 1, 00:40:04
* via fe80::c299:6bff:fec0:e1ca, swp52, weight 1, 00:40:04
* via fe80::f208:5fff:fe12:cc8c, swp51, weight 1, 00:40:04
B>* 10.0.1.254/32 [20/0] via fe80::2ef3:45ff:fef4:6f5f, swp53, weight 1, 00:35:18
* via fe80::ae56:f0ff:fef3:590c, swp54, weight 1, 00:35:18
* via fe80::c299:6bff:fec0:e1ca, swp52, weight 1, 00:35:18
* via fe80::f208:5fff:fe12:cc8c, swp51, weight 1, 00:35:18
C>* 10.10.10.1/32 is directly connected, lo, 00:42:58
B>* 10.10.10.2/32 [200/0] via fe80::c28a:e6ff:fe03:96d0, peerlink.4094, weight 1, 00:42:56
B>* 10.10.10.3/32 [20/0] via fe80::2ef3:45ff:fef4:6f5f, swp53, weight 1, 00:42:55
* via fe80::ae56:f0ff:fef3:590c, swp54, weight 1, 00:42:55
* via fe80::c299:6bff:fec0:e1ca, swp52, weight 1, 00:42:55
* via fe80::f208:5fff:fe12:cc8c, swp51, weight 1, 00:42:55
B>* 10.10.10.4/32 [20/0] via fe80::2ef3:45ff:fef4:6f5f, swp53, weight 1, 00:42:55
* via fe80::ae56:f0ff:fef3:590c, swp54, weight 1, 00:42:55
* via fe80::c299:6bff:fec0:e1ca, swp52, weight 1, 00:42:55
* via fe80::f208:5fff:fe12:cc8c, swp51, weight 1, 00:42:55
B>* 10.10.10.63/32 [20/0] via fe80::2ef3:45ff:fef4:6f5f, swp53, weight 1, 00:42:55
* via fe80::ae56:f0ff:fef3:590c, swp54, weight 1, 00:42:55
* via fe80::c299:6bff:fec0:e1ca, swp52, weight 1, 00:42:55
* via fe80::f208:5fff:fe12:cc8c, swp51, weight 1, 00:42:55
B>* 10.10.10.64/32 [20/0] via fe80::2ef3:45ff:fef4:6f5f, swp53, weight 1, 00:38:07
* via fe80::ae56:f0ff:fef3:590c, swp54, weight 1, 00:38:07
* via fe80::c299:6bff:fec0:e1ca, swp52, weight 1, 00:38:07
* via fe80::f208:5fff:fe12:cc8c, swp51, weight 1, 00:38:07
B>* 10.10.10.101/32 [20/0] via fe80::f208:5fff:fe12:cc8c, swp51, weight 1, 00:42:56
B>* 10.10.10.102/32 [20/0] via fe80::c299:6bff:fec0:e1ca, swp52, weight 1, 00:42:56
B>* 10.10.10.103/32 [20/0] via fe80::2ef3:45ff:fef4:6f5f, swp53, weight 1, 00:42:56
B>* 10.10.10.104/32 [20/0] via fe80::ae56:f0ff:fef3:590c, swp54, weight 1, 00:42:56
Show EVPN Address Family Peers
Run the vtysh show bgp l2vpn evpn summary command or the net show bgp l2vpn evpn summary command to see the BGP peers participating in the EVPN address family and their states. The following example output from a leaf switch shows eBGP peering with four spine switches to exchange EVPN routes; all peering sessions are in the established state.
cumulus@leaf01:mgmt:~$ sudo vtysh
leaf01# show bgp l2vpn evpn summary
BGP router identifier 10.10.10.1, local AS number 65101 vrf-id 0
BGP table version 0
RIB entries 23, using 4416 bytes of memory
Peers 4, using 85 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
spine01(swp51) 4 65199 958 949 0 0 0 00:44:46 34
spine02(swp52) 4 65199 958 949 0 0 0 00:44:46 34
spine03(swp53) 4 65199 958 949 0 0 0 00:44:46 34
spine04(swp54) 4 65199 958 949 0 0 0 00:44:46 34
Total number of neighbors 4
Show EVPN VNIs
To display the configured VNIs on a network device participating in BGP EVPN, run the vtysh show bgp l2vpn evpn vni command. This command is only relevant on a VTEP. For symmetric routing, this command displays the special layer 3 VNIs for each tenant VRF.
cumulus@leaf01:mgmt:~$ sudo vtysh
leaf01# show bgp l2vpn evpn vni
Advertise Gateway Macip: Disabled
Advertise SVI Macip: Disabled
Advertise All VNI flag: Enabled
BUM flooding: Head-end replication
Number of L2 VNIs: 3
Number of L3 VNIs: 2
Flags: * - Kernel
VNI Type RD Import RT Export RT Tenant VRF
* 20 L2 10.10.10.1:4 65101:20 65101:20 RED
* 30 L2 10.10.10.1:6 65101:30 65101:30 BLUE
* 10 L2 10.10.10.1:3 65101:10 65101:10 RED
* 4002 L3 10.1.30.2:2 65101:4002 65101:4002 BLUE
* 4001 L3 10.1.20.2:5 65101:4001 65101:4001 RED
Run the vtysh show evpn vni command to see a summary of all VNIs and the number of MAC or ARP entries associated with each VNI.
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show evpn vni
VNI Type VxLAN IF # MACs # ARPs # Remote VTEPs Tenant VRF
20 L2 vni20 8 5 1 RED
30 L2 vni30 8 4 1 BLUE
10 L2 vni10 8 6 1 RED
4001 L3 vniRED 1 1 n/a RED
4002 L3 vniBLUE 0 0 n/a BLUE
You can also show the above information with the NVUE nv show evpn vni and nv show vrf <vrf> evpn vni commands.
Run the NVUE nv show evpn vni <vni> command or the vtysh show evpn vni <vni> command to examine EVPN information for a specific VNI in detail. The following example output shows details for the layer 2 VNI 10. The output shows the remote VTEPs that contain that VNI.
cumulus@leaf01:mgmt:~$ nv show evpn vni 10
operational applied
----------------- ----------- -------
route-advertise
default-gateway off
svi-ip off
bridge-domain br_default
host-count 3
local-vtep 10.10.10.1
mac-count 7
remote-vtep-count 2
tenant-vrf RED
vlan 10
vxlan-interface vxlan48
remote-vtep
==============
flood
--------- -----
10.0.1.12 HER
10.0.1.34 HER
To show VNI BGP information run the NVUE nv show evpn vni <id> bgp-info and nv show vrf <vrf_id> evpn bgp-info commands, or the vtysh show bgp l2vpn evpn vni <vni> command.
Run the NVUE nv show evpn vni <vni> mac command or the vtysh show evpn mac vni <vni> command to examine all local and remote MAC addresses for a VNI. This command is only relevant for a layer 2 VNI:
cumulus@leaf01:mgmt:~$ nv show evpn vni 10 mac
LocMobSeq - local mobility sequence, RemMobSeq - remote mobility sequence,
RemoteVtep - Remote Vtep address, Esi - Remote Esi
MAC address Type State LocMobSeq RemMobSeq Interface RemoteVtep Esi
----------------- ------ ----- --------- --------- --------- ---------- ---
44:38:39:22:01:7a local 0 0 vlan10
44:38:39:22:01:8a remote 0 0 10.0.1.34
44:38:39:22:01:84 remote 0 0 10.0.1.34
48:b0:2d:0c:a9:f4 remote 0 0 10.0.1.12
48:b0:2d:3a:a3:38 local 1408 1407 bond1
48:b0:2d:d2:ac:68 remote 0 0 10.0.1.34
48:b0:2d:eb:26:6e remote 1 0 10.0.1.34
Run the vtysh show evpn mac vni all command or the net show evpn mac vni all command to examine MAC addresses for all VNIs.
You can examine the details for a specific MAC addresses or query all remote MAC addresses behind a specific VTEP:
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show evpn mac vni 10 mac 94:8e:1c:0d:77:93
MAC: 94:8e:1c:0d:77:93
Remote VTEP: 10.0.1.2
Sync-info: neigh#: 0
Local Seq: 0 Remote Seq: 0
Neighbors:
No Neighbors
leaf01# show evpn mac vni 20 vtep 10.0.1.2
VNI 20
MAC Type FlagsIntf/Remote ES/VTEP VLAN Seq #'s
12:15:9a:9c:f2:e1 remote 10.0.1.2 1/0
50:88:b2:3c:08:f9 remote 10.0.1.2 0/0
f8:4f:db:ef:be:8b remote 10.0.1.2 0/0
c8:7d:bc:96:71:f3 remote 10.0.1.2 0/0
Examine Local and Remote Neighbors for a VNI
Run the vtysh show evpn arp-cache vni <vni> command or the net show evpn arp-cache vni <vni> command to examine all local and remote neighbors (ARP entries) for a VNI. This command is only relevant for a layer 2 VNI and the output shows both IPv4 and IPv6 neighbor entries:
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show evpn arp-cache vni 10
Number of ARPs (local and remote) known for this VNI: 6
Flags: I=local-inactive, P=peer-active, X=peer-proxy
Neighbor Type Flags State MAC Remote ES/VTEP Seq #'s
10.1.10.2 local active 76:ed:2a:8a:67:24 0/0
fe80::968e:1cff:fe0d:7793 remote active 68:0f:31:ae:3d:7a 10.0.1.2 0/0
10.1.10.101 local active 26:76:e6:93:32:78 0/0
fe80::9465:45ff:fe6d:4890 local active 26:76:e6:93:32:78 0/0
10.1.10.104 remote active 68:0f:31:ae:3d:7a 10.0.1.2 0/0
fe80::74ed:2aff:fe8a:6724 local active 76:ed:2a:8a:67:24 0/0
...
Run the vtysh show evpn arp-cache vni all command or the net show evpn arp-cache vni all command to examine neighbor entries for all VNIs.
Examine Remote Router MAC Addresses
To examine the router MAC addresses corresponding to all remote VTEPs for symmetric routing, run the NVUE nv show vrf <vrf> evpn remote-router-mac command or the vtysh show evpn rmac vni all command. This command is only relevant for a layer 3 VNI:
cumulus@border01:mgmt:~$ nv show vrf RED evpn remote-router-mac
MAC address remote-vtep
----------------- -----------
44:38:39:22:01:7a 10.10.10.1
44:38:39:22:01:7c 10.10.10.64
44:38:39:22:01:8a 10.10.10.4
44:38:39:22:01:78 10.10.10.2
44:38:39:22:01:84 10.10.10.3
44:38:39:be:ef:aa 10.0.1.12
Examine Gateway Next Hops
To examine the gateway next hops for symmetric routing, run the NVUE nv show vrf <vrf> evpn nexthop-vtep command or the vtysh show evpn next-hops vni all command. This command is only relevant for a layer 3 VNI. The gateway next hop IP addresses correspond to the remote VTEP IP addresses. Cumulus Linux installs the remote host and prefix routes using these next hops.
To show the VTEP IP addresses for the next hop groups, run the nv show evpn l2-nhg vtep-ip command.
Show Access VLANs
To show access VLANs on the switch and their corresponding VNI, run the NVUE nv show evpn access-vlan-info vlan command or the vtysh show evpn access-vlan command.
You can drill down and show information about a specific vlan with the nv show evpn access-vlan-info vlan <vlan> command.
Show the VRF Routing Table in FRR
Run the vtysh show ip route vrf <vrf-name> command or the net show route vrf <vrf-name> command to examine the VRF routing table. Use this command for symmetric routing to verify that remote host and prefix routes are in the VRF routing table and point to the appropriate gateway next hop.
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show ip route vrf RED
show ip route vrf RED
======================
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route
VRF RED:
K>* 0.0.0.0/0 [255/8192] unreachable (ICMP unreachable), 00:53:46
C * 10.1.10.0/24 [0/1024] is directly connected, vlan10-v0, 00:53:46
C>* 10.1.10.0/24 is directly connected, vlan10, 00:53:46
B>* 10.1.10.104/32 [20/0] via 10.0.1.2, vlan4001 onlink, weight 1, 00:43:55
C * 10.1.20.0/24 [0/1024] is directly connected, vlan20-v0, 00:53:46
C>* 10.1.20.0/24 is directly connected, vlan20, 00:53:46
B>* 10.1.20.105/32 [20/0] via 10.0.1.2, vlan4001 onlink, weight 1, 00:20:07
...
In the output above, EVPN specifies the next hops for these routes to be onlink, or reachable over the specified SVI. This is necessary because this interface does not need to have an IP address. Even if the interface has an IP address, the next hop is not on the same subnet as it is typically the IP address of the remote VTEP (part of the underlay IP network).
Show the Global BGP EVPN Routing Table
Run the vtysh show bgp l2vpn evpn route command or the net show bgp l2vpn evpn route command to display all EVPN routes, both local and remote. Cumulus Linux bases the routes on the RD as they are across VNIs and VRFs:
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show bgp l2vpn evpn route
BGP table version is 6, local router ID is 10.10.10.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]:[EthTag]:[IPlen]:[VTEP-IP]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]
Network Next Hop Metric LocPrf Weight Path
Extended Community
Route Distinguisher: 10.10.10.1:3
*> [2]:[0]:[48]:[00:60:08:69:97:ef]
10.0.1.1 32768 i
ET:8 RT:65101:10 RT:65101:4001 Rmac:44:38:39:be:ef:aa
*> [2]:[0]:[48]:[26:76:e6:93:32:78]
10.0.1.1 32768 i
ET:8 RT:65101:10 RT:65101:4001 Rmac:44:38:39:be:ef:aa
*> [2]:[0]:[48]:[26:76:e6:93:32:78]:[32]:[10.1.10.101]
10.0.1.1 32768 i
ET:8 RT:65101:10 RT:65101:4001 Rmac:44:38:39:be:ef:aa
*> [2]:[0]:[48]:[26:76:e6:93:32:78]:[128]:[fe80::9465:45ff:fe6d:4890]
10.0.1.1 32768 i
ET:8 RT:65101:10
*> [2]:[0]:[48]:[c0:8a:e6:03:96:d0]
10.0.1.1 32768 i
ET:8 RT:65101:10 RT:65101:4001 MM:0, sticky MAC Rmac:44:38:39:be:ef:aa
*> [3]:[0]:[32]:[10.0.1.1]
10.0.1.1 32768 i
ET:8 RT:65101:10
Route Distinguisher: 10.10.10.1:4
*> [2]:[0]:[48]:[c0:8a:e6:03:96:d0]
10.0.1.1 32768 i
ET:8 RT:65101:20 RT:65101:4001 MM:0, sticky MAC Rmac:44:38:39:be:ef:aa
*> [2]:[0]:[48]:[cc:6e:fa:8d:ff:92]
10.0.1.1 32768 i
ET:8 RT:65101:20 RT:65101:4001 Rmac:44:38:39:be:ef:aa
*> [2]:[0]:[48]:[f0:9d:d0:59:60:5d]
10.0.1.1 32768 i
ET:8 RT:65101:20 RT:65101:4001 Rmac:44:38:39:be:ef:aa
*> [2]:[0]:[48]:[f0:9d:d0:59:60:5d]:[128]:[fe80::ce6e:faff:fe8d:ff92]
10.0.1.1 32768 i
ET:8 RT:65101:20
*> [3]:[0]:[32]:[10.0.1.1]
10.0.1.1 32768 i
ET:8 RT:65101:20
Route Distinguisher: 10.10.10.1:6
*> [2]:[0]:[48]:[c0:8a:e6:03:96:d0]
10.0.1.1 32768 i
ET:8 RT:65101:30 RT:65101:4002 MM:0, sticky MAC Rmac:44:38:39:be:ef:aa
*> [2]:[0]:[48]:[de:02:3b:17:c9:6d]
10.0.1.1 32768 i
ET:8 RT:65101:30 RT:65101:4002 Rmac:44:38:39:be:ef:aa
*> [2]:[0]:[48]:[de:02:3b:17:c9:6d]:[128]:[fe80::dc02:3bff:fe17:c96d]
10.0.1.1 32768 i
ET:8 RT:65101:30
*> [2]:[0]:[48]:[ea:77:bb:f1:a7:ca]
10.0.1.1 32768 i
ET:8 RT:65101:30 RT:65101:4002 Rmac:44:38:39:be:ef:aa
*> [3]:[0]:[32]:[10.0.1.1]
10.0.1.1 32768 i
ET:8 RT:65101:30
Route Distinguisher: 10.10.10.3:3
*> [2]:[0]:[48]:[12:15:9a:9c:f2:e1]
10.0.1.2 0 65199 65102 i
RT:65102:20 RT:65102:4001 ET:8 Rmac:44:38:39:be:ef:bb
* [2]:[0]:[48]:[12:15:9a:9c:f2:e1]
10.0.1.2 0 65199 65102 i
RT:65102:20 RT:65102:4001 ET:8 Rmac:44:38:39:be:ef:bb
...
You can filter the routing table based on EVPN route type. The available options are:
ead: EAD (Type-1) route
es: Ethernet Segment (type-4) route
macip: MAC-IP (Type-2) route
multicast: Multicast
prefix: An IPv4 or IPv6 prefix
Show a Specific EVPN Route
To drill down on a specific route for more information, run the vtysh show bgp l2vpn evpn route rd <rd-value> command or the net show bgp l2vpn evpn route rd <rd-value> command. This command displays all EVPN routes with that RD and with the path attribute details for each path. Additional filtering is possible based on route type or by specifying the MAC and/or IP address. The following example shows the specific MAC/IP route of server05. The output shows that this remote host is behind VTEP 10.10.10.3 and is reachable through four paths; one through each spine switch. This example is from a symmetric routing configuration, so the route shows both the layer 2 VNI (20) and the layer 3 VNI (4001), as well as the EVPN route target attributes corresponding to each and the associated router MAC address.
cumulus@leaf01:mgmt:~$ sudo vtysh
leaf01# show bgp l2vpn evpn route rd 10.10.10.3:3 mac 12:15:9a:9c:f2:e1 ip 10.1.20.105
BGP routing table entry for 10.10.10.3:3:[2]:[0]:[48]:[12:15:9a:9c:f2:e1]:[32]:[10.1.20.105]
Paths: (4 available, best #1)
Advertised to non peer-group peers:
spine01(swp51) spine02(swp52) spine03(swp53) spine04(swp54)
Route [2]:[0]:[48]:[12:15:9a:9c:f2:e1]:[32]:[10.1.20.105] VNI 20/4001
65199 65102
10.0.1.2 from spine01(swp51) (10.10.10.101)
Origin IGP, valid, external, bestpath-from-AS 65199, best (Router ID)
Extended Community: RT:65102:20 RT:65102:4001 ET:8 Rmac:44:38:39:be:ef:bb
Last update: Fri Jan 15 08:16:24 2021
Route [2]:[0]:[48]:[12:15:9a:9c:f2:e1]:[32]:[10.1.20.105] VNI 20/4001
65199 65102
10.0.1.2 from spine04(swp54) (10.10.10.104)
Origin IGP, valid, external
Extended Community: RT:65102:20 RT:65102:4001 ET:8 Rmac:44:38:39:be:ef:bb
Last update: Fri Jan 15 08:16:24 2021
Route [2]:[0]:[48]:[12:15:9a:9c:f2:e1]:[32]:[10.1.20.105] VNI 20/4001
65199 65102
10.0.1.2 from spine02(swp52) (10.10.10.102)
Origin IGP, valid, external
Extended Community: RT:65102:20 RT:65102:4001 ET:8 Rmac:44:38:39:be:ef:bb
Last update: Fri Jan 15 08:16:24 2021
Route [2]:[0]:[48]:[12:15:9a:9c:f2:e1]:[32]:[10.1.20.105] VNI 20/4001
65199 65102
10.0.1.2 from spine03(swp53) (10.10.10.103)
Origin IGP, valid, external
Extended Community: RT:65102:20 RT:65102:4001 ET:8 Rmac:44:38:39:be:ef:bb
Last update: Fri Jan 15 08:16:24 2021
Displayed 4 paths for requested prefix
Only use global VNIs. Even though the switch exchanges VNI values in the type-2 and type-5 routes, Cumulus Linux does not use the received values when installing the routes into the forwarding plane but uses the local configuration instead. Ensure that the VLAN to VNI mappings and the layer 3 VNI assignment for a tenant VRF are the same throughout the network.
If the remote host is dual attached, the next hop for the EVPN route is the anycast IP address of the remote MLAG pair when MLAG is active.
Show the VNI EVPN Routing Table
The switch maintains the received EVPN routes in the global EVPN routing table (described above), even if there are no appropriate local VNIs to import them into. For example, a spine maintains the global EVPN routing table even though there are no VNIs present in the table. When local VNIs are present, the switch imports received EVPN routes into the per-VNI routing tables according to the route target attributes. You can examine the per-VNI routing table with the vtysh show bgp vni <vni> command:
leaf01# show bgp vni 10
BGP table version is 351, local router ID is 10.10.10.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]:[EthTag]:[IPlen]:[VTEP-IP]:[Frag-id]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]
Network Next Hop Metric LocPrf Weight Path
*> [2]:[0]:[48]:[44:38:39:00:00:32]:[32]:[10.1.10.101]
10.0.1.12 (leaf01)
32768 i
ET:8 RT:65101:10 RT:65101:4001 Rmac:44:38:39:be:ef:aa
*> [2]:[0]:[48]:[44:38:39:00:00:32]:[128]:[fe80::4638:39ff:fe00:32]
10.0.1.12 (leaf01)
32768 i
ET:8 RT:65101:10
* [2]:[0]:[48]:[44:38:39:00:00:3e]:[128]:[fe80::4638:39ff:fe00:3e]
10.0.1.34 (leaf02)
0 65102 65199 65104 i
RT:65104:10 ET:8
* [2]:[0]:[48]:[44:38:39:00:00:3e]:[128]:[fe80::4638:39ff:fe00:3e]
10.0.1.34 (leaf02)
0 65102 65199 65103 i
RT:65103:10 ET:8
* [2]:[0]:[48]:[44:38:39:00:00:3e]:[128]:[fe80::4638:39ff:fe00:3e]
10.0.1.34 (spine02)
0 65199 65104 i
RT:65104:10 ET:8
* [2]:[0]:[48]:[44:38:39:00:00:3e]:[128]:[fe80::4638:39ff:fe00:3e]
10.0.1.34 (spine02)
0 65199 65103 i
RT:65103:10 ET:8
* [2]:[0]:[48]:[44:38:39:00:00:3e]:[128]:[fe80::4638:39ff:fe00:3e]
10.0.1.34 (spine04)
0 65199 65104 i
RT:65104:10 ET:8
* [2]:[0]:[48]:[44:38:39:00:00:3e]:[128]:[fe80::4638:39ff:fe00:3e]
10.0.1.34 (spine04)
0 65199 65103 i
RT:65103:10 ET:8
* [2]:[0]:[48]:[44:38:39:00:00:3e]:[128]:[fe80::4638:39ff:fe00:3e]
10.0.1.34 (spine03)
0 65199 65104 i
RT:65104:10 ET:8
* [2]:[0]:[48]:[44:38:39:00:00:3e]:[128]:[fe80::4638:39ff:fe00:3e]
10.0.1.34 (spine03)
0 65199 65103 i
RT:65103:10 ET:8
* [2]:[0]:[48]:[44:38:39:00:00:3e]:[128]:[fe80::4638:39ff:fe00:3e]
10.0.1.34 (spine01)
0 65199 65104 i
RT:65104:10 ET:8
*> [2]:[0]:[48]:[44:38:39:00:00:3e]:[128]:[fe80::4638:39ff:fe00:3e]
10.0.1.34 (spine01)
0 65199 65103 i
RT:65103:10 ET:8
...
To display the VNI routing table for all VNIs, run the vtysh show bgp l2vpn evpn route vni all command.
To view the EVPN RIB with NVUE, run the nv show vrf <vrf> router bgp address-family l2vpn-evpn loc-rib rd <rd> route-type <type> route command.
Show the VRF BGP Routing Table
For symmetric routing, the switch imports received type-2 and type-5 routes into the VRF routing table (according to address family: IPv4 unicast or IPv6 unicast) based on a match on the route target attributes. To examine the BGP VRF routing table, run the vtysh show bgp vrf <vrf-name> ipv4 unicast and show bgp vrf <vrf-name> ipv6 unicast command. You can also run the net show bgp vrf <vrf-name> ipv4 unicast command or the net show bgp vrf <vrf-name> ipv6 unicast command.
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show bgp vrf RED ipv4 unicast
BGP table version is 2, local router ID is 10.1.20.2, vrf id 24
Default local pref 100, local AS 65101
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
* 10.1.10.104/32 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
*> 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
* 10.1.20.105/32 10.0.1.2< 0 65199 65102 i
*> 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
* 10.0.1.2< 0 65199 65102 i
Displayed 2 routes and 16 total paths
Support for EVPN Neighbor Discovery (ND) Extended Community
In EVPN VXLAN with ARP and ND suppression where you only configure the VTEPs for layer 2, EVPN needs to carry additional information for the attached devices so proxy ND can provide the correct information to attached hosts. Without this information, hosts cannot configure their default routers or lose their existing default router information. Cumulus Linux supports the EVPN Neighbor Discovery (ND) Extended Community with a type field value of 0x06, a subtype field value of 0x08 (ND Extended Community), and a router flag; this enables the switch to determine if a particular IPv6-MAC pair belongs to a host or a router.
The following configurations use the router flag (R-bit):
Centralized VXLAN routing with a gateway router.
A layer 2 switch with ARP and ND suppression.
When the MAC/IP (type-2) route contains the IPv6-MAC pair with the R-bit flag, the route belongs to a router. If the R-bit is zero, the route belongs to a host. If the router is in a local LAN segment, the switch implementing the proxy ND function learns of this information by snooping on neighbor advertisement messages for the associated IPv6 address. Other EVPN peers exchange this information by using the ND extended community in BGP updates.
To show that the neighbor table includes the EVPN arp-cache and that the IPv6-MAC entry belongs to a router, run the vtysh show evpn arp-cache vni <vni> ip <address> command or the net show evpn arp-cache vni <vni> ip <address> command. For example:
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show evpn arp-cache vni 20 ip 10.1.20.105
IP: 10.1.20.105
Type: remote
State: active
MAC: 12:15:9a:9c:f2:e1
Sync-info: -
Remote VTEP: 10.0.1.2
Local Seq: 0 Remote Seq: 0
Examine MAC Moves
The first time a MAC moves from behind one VTEP to behind another, BGP associates a MAC Mobility (MM) extended community attribute of sequence number 1, with the type-2 route for that MAC. From there, each time this MAC moves to a new VTEP, the MM sequence number increments by 1. You can examine the MM sequence number associated with a MAC’s type-2 route with the vtysh show bgp l2vpn evpn route vni <vni> mac <mac> command or the net show bgp l2vpn evpn route vni <vni> mac <mac> command. The example output below shows the type-2 route for a MAC that has moved three times:
cumulus@switch:~$ sudo vtysh
...
switch# show bgp l2vpn evpn route vni 10109 mac 00:02:22:22:22:02
BGP routing table entry for [2]:[0]:[0]:[48]:[00:02:22:22:22:02]
Paths: (1 available, best #1)
Not advertised to any peer
Route [2]:[0]:[0]:[48]:[00:02:22:22:22:02] VNI 10109
Local
6.0.0.184 from 0.0.0.0 (6.0.0.184)
Origin IGP, localpref 100, weight 32768, valid, sourced, local, bestpath-from-AS Local, best
Extended Community: RT:650184:10109 ET:8 MM:3
AddPath ID: RX 0, TX 10350121
Last update: Tue Feb 14 18:40:37 2017
Displayed 1 paths for requested prefix
Examine Static MAC Addresses
You can identify static or sticky MACs in EVPN by the presence of MM:0, sticky MAC in the Extended Community line of the output from the vtysh show bgp l2vpn evpn route vni <vni> mac <mac> command or the net show bgp l2vpn evpn route vni <vni> mac <mac> command.
cumulus@switch:~$ sudo vtysh
...
switch# show bgp l2vpn evpn route vni 10101 mac 00:02:00:00:00:01
BGP routing table entry for [2]:[0]:[0]:[48]:[00:02:00:00:00:01]
Paths: (1 available, best #1)
Not advertised to any peer
Route [2]:[0]:[0]:[48]:[00:02:00:00:00:01] VNI 10101
Local
172.16.130.18 from 0.0.0.0 (172.16.130.18)
Origin IGP, localpref 100, weight 32768, valid, sourced, local, bestpath-from-AS Local, best
Extended Community: ET:8 RT:60176:10101 MM:0, sticky MAC
AddPath ID: RX 0, TX 46
Last update: Tue Apr 11 21:44:02 2017
Displayed 1 paths for requested prefix
Enable FRR Debug Logs
To troubleshoot EVPN, enable FRR debug logs. The relevant debug options are:
Option
Description
debug zebra vxlan
Traces VNI addition and deletion (local and remote) as well as MAC and neighbor addition and deletion (local and remote).
debug zebra kernel
Traces actual netlink messages exchanged with the kernel, which includes everything, not just EVPN.
debug bgp updates
Traces BGP update exchanges, including all updates. The output also shows EVPN specific information.
debug bgp zebra
Traces interactions between BGP and zebra for EVPN (and other) routes.
ICMP echo Replies and the ping Command
When you run the ping -I command and specify an interface, you do not receive an ICMP echo reply. However, when you run the ping command without the -I option, everything works as expected.
ping -I command example:
cumulus@switch:default:~:# ping -I swp2 10.0.10.1
PING 10.0.10.1 (10.0.10.1) from 10.0.0.2 swp1.5: 56(84) bytes of data.
ping command example:
cumulus@switch:default:~:# ping 10.0.10.1
PING 10.0.10.1 (10.0.10.1) 56(84) bytes of data.
64 bytes from 10.0.10.1: icmp_req=1 ttl=63 time=4.00 ms
64 bytes from 10.0.10.1: icmp_req=2 ttl=63 time=0.000 ms
64 bytes from 10.0.10.1: icmp_req=3 ttl=63 time=0.000 ms
64 bytes from 10.0.10.1: icmp_req=4 ttl=63 time=0.000 ms
^C
--- 10.0.10.1 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3004ms
rtt min/avg/max/mdev = 0.000/1.000/4.001/1.732 ms
When you send an ICMP echo request to an IP address that is not in the same subnet using the ping -I command, Cumulus Linux creates a failed ARP entry for the destination IP address.
This section shows the following EVPN configuration examples:
Layer 2 EVPN with external routing
EVPN centralized routing
EVPN symmetric routing
Layer 2 EVPN with External Routing
The following example configures a network infrastructure that creates a layer 2 extension between racks. Inter-VXLAN routed traffic routes between VXLANs on an external device.
MLAG is between leaf01 and leaf02, and leaf03 and leaf04
BGP unnumbered is in the underlay (configured on all leafs and spines)
Server gateways are outside the VXLAN fabric
The following images shows traffic flow between tenants. For simplicity, the images do not show spines and other devices.
Traffic Flow between server01 and server04
server01 and server04 are in the same VLAN but are across different leafs.
server01 makes a LACP hash decision and forwards traffic to leaf01.
leaf01 does a layer 2 lookup, has the MAC address for server04, and forwards the packet out VNI10, towards leaf04.
The VXLAN encapsulated frame arrives on leaf04, which does a layer 2 lookup and has the MAC address for server04 in VLAN10.
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp1-2,swp49-54
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf01:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond1 link mtu 9000
cumulus@leaf01:~$ nv set interface bond2 link mtu 9000
cumulus@leaf01:~$ nv set interface bond1-2 bridge domain br_default
cumulus@leaf01:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf01:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20
cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:~$ nv set mlag priority 1000
cumulus@leaf01:~$ nv set mlag init-delay 10
cumulus@leaf01:~$ nv set interface vlan10
cumulus@leaf01:~$ nv set interface vlan20
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf01:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf01:~$ nv set nve vxlan mlag shared-address 10.0.1.12
cumulus@leaf01:~$ nv set nve vxlan source address 10.10.10.1
cumulus@leaf01:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf01:~$ nv set evpn enable on
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf01:~$ nv config apply
cumulus@leaf02:~$ nv set interface lo ip address 10.10.10.2/32
cumulus@leaf02:~$ nv set interface swp1-2,swp49-54
cumulus@leaf02:~$ nv set interface bond1 bond member swp1
cumulus@leaf02:~$ nv set interface bond2 bond member swp2
cumulus@leaf02:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf02:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf02:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond1 link mtu 9000
cumulus@leaf02:~$ nv set interface bond2 link mtu 9000
cumulus@leaf02:~$ nv set interface bond1-2 bridge domain br_default
cumulus@leaf02:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf02:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10,20
cumulus@leaf02:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf02:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:~$ nv set mlag backup 10.10.10.1
cumulus@leaf02:~$ nv set mlag peer-ip linklocal
cumulus@leaf02:~$ nv set mlag priority 2000
cumulus@leaf02:~$ nv set mlag init-delay 10
cumulus@leaf02:~$ nv set interface vlan10
cumulus@leaf02:~$ nv set interface vlan20
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf02:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf02:~$ nv set nve vxlan mlag shared-address 10.0.1.12
cumulus@leaf02:~$ nv set nve vxlan source address 10.10.10.2
cumulus@leaf02:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf02:~$ nv set evpn enable on
cumulus@leaf02:~$ nv set router bgp autonomous-system 65102
cumulus@leaf02:~$ nv set router bgp router-id 10.10.10.2
cumulus@leaf02:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf02:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf02:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf02:~$ nv config apply
cumulus@leaf03:~$ nv set interface lo ip address 10.10.10.3/32
cumulus@leaf03:~$ nv set interface swp1-2,swp49-54
cumulus@leaf03:~$ nv set interface bond1 bond member swp1
cumulus@leaf03:~$ nv set interface bond2 bond member swp2
cumulus@leaf03:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf03:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf03:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf03:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf03:~$ nv set interface bond1 link mtu 9000
cumulus@leaf03:~$ nv set interface bond2 link mtu 9000
cumulus@leaf03:~$ nv set interface bond1-2 bridge domain br_default
cumulus@leaf03:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf03:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf03:~$ nv set bridge domain br_default vlan 10,20
cumulus@leaf03:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf03:~$ nv set mlag mac-address 44:38:39:BE:EF:BB
cumulus@leaf03:~$ nv set mlag backup 10.10.10.4
cumulus@leaf03:~$ nv set mlag peer-ip linklocal
cumulus@leaf03:~$ nv set mlag priority 1000
cumulus@leaf03:~$ nv set mlag init-delay 10
cumulus@leaf03:~$ nv set interface vlan10
cumulus@leaf03:~$ nv set interface vlan20
cumulus@leaf03:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf03:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf03:~$ nv set nve vxlan mlag shared-address 10.0.1.34
cumulus@leaf03:~$ nv set nve vxlan source address 10.10.10.3
cumulus@leaf03:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf03:~$ nv set evpn enable on
cumulus@leaf03:~$ nv set router bgp autonomous-system 65103
cumulus@leaf03:~$ nv set router bgp router-id 10.10.10.3
cumulus@leaf03:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf03:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf03:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf03:~$ nv config apply
cumulus@leaf04:~$ nv set interface lo ip address 10.10.10.4/32
cumulus@leaf04:~$ nv set interface swp1-2,swp49-54
cumulus@leaf04:~$ nv set interface bond1 bond member swp1
cumulus@leaf04:~$ nv set interface bond2 bond member swp2
cumulus@leaf04:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf04:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf04:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf04:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf04:~$ nv set interface bond1 link mtu 9000
cumulus@leaf04:~$ nv set interface bond2 link mtu 9000
cumulus@leaf04:~$ nv set interface bond1-2 bridge domain br_default
cumulus@leaf04:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf04:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf04:~$ nv set bridge domain br_default vlan 10,20
cumulus@leaf04:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf04:~$ nv set mlag mac-address 44:38:39:BE:EF:BB
cumulus@leaf04:~$ nv set mlag backup 10.10.10.3
cumulus@leaf04:~$ nv set mlag peer-ip linklocal
cumulus@leaf04:~$ nv set mlag priority 2000
cumulus@leaf04:~$ nv set mlag init-delay 10
cumulus@leaf04:~$ nv set interface vlan10
cumulus@leaf04:~$ nv set interface vlan20
cumulus@leaf04:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf04:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf04:~$ nv set nve vxlan mlag shared-address 10.0.1.34
cumulus@leaf04:~$ nv set nve vxlan source address 10.10.10.4
cumulus@leaf04:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf04:~$ nv set evpn enable on
cumulus@leaf04:~$ nv set router bgp autonomous-system 65104
cumulus@leaf04:~$ nv set router bgp router-id 10.10.10.4
cumulus@leaf04:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf04:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf04:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf04:~$ nv config apply
cumulus@spine01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp1-6
cumulus@spine01:~$ nv set router bgp autonomous-system 65199
cumulus@spine01:~$ nv set router bgp router-id 10.10.10.101
cumulus@spine01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp1 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp2 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp3 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp4 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp5 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp6 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@spine01:~$ nv config apply
cumulus@spine02:~$ nv set interface lo ip address 10.10.10.102/32
cumulus@spine02:~$ nv set interface swp1-6
cumulus@spine02:~$ nv set router bgp autonomous-system 65199
cumulus@spine02:~$ nv set router bgp router-id 10.10.10.102
cumulus@spine02:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp1 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp2 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp3 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp4 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp5 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp6 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine02:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@spine02:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@spine02:~$ nv config apply
cumulus@spine03:~$ nv set interface lo ip address 10.10.10.103/32
cumulus@spine03:~$ nv set interface swp1-6
cumulus@spine03:~$ nv set router bgp autonomous-system 65199
cumulus@spine03:~$ nv set router bgp router-id 10.10.10.103
cumulus@spine03:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp1 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp2 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp3 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp4 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp5 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp6 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine03:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@spine03:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@spine03:~$ nv config apply
cumulus@spine04:~$ nv set interface lo ip address 10.10.10.104/32
cumulus@spine04:~$ nv set interface swp1-6
cumulus@spine04:~$ nv set router bgp autonomous-system 65199
cumulus@spine04:~$ nv set router bgp router-id 10.10.10.104
cumulus@spine04:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp1 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp2 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp3 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp4 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp5 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp6 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine04:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@spine04:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@spine04:~$ nv config apply
cumulus@border01:~$ nv set interface lo ip address 10.10.10.63/32
cumulus@border01:~$ nv set interface swp3,swp49-54
cumulus@border01:~$ nv set interface bond3 bond member swp3
cumulus@border01:~$ nv set interface bond3 bond mlag id 1
cumulus@border01:~$ nv set interface bond3 bond lacp-bypass on
cumulus@border01:~$ nv set interface bond3 link mtu 9000
cumulus@border01:~$ nv set interface bond3 bridge domain br_default
cumulus@border01:~$ nv set interface peerlink bond member swp49-50
cumulus@border01:~$ nv set mlag mac-address 44:38:39:BE:EF:FF
cumulus@border01:~$ nv set mlag backup 10.10.10.64
cumulus@border01:~$ nv set mlag peer-ip linklocal
cumulus@border01:~$ nv set mlag priority 1000
cumulus@border01:~$ nv set mlag init-delay 10
cumulus@border01:~$ nv set interface vlan10
cumulus@border01:~$ nv set interface vlan20
cumulus@border01:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@border01:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@border01:~$ nv set interface bond3 bridge domain br_default vlan 10,20
cumulus@border01:~$ nv set nve vxlan mlag shared-address 10.0.1.255
cumulus@border01:~$ nv set nve vxlan source address 10.10.10.63
cumulus@border01:~$ nv set nve vxlan arp-nd-suppress on
cumulus@border01:~$ nv set evpn enable on
cumulus@border01:~$ nv set router bgp autonomous-system 65253
cumulus@border01:~$ nv set router bgp router-id 10.10.10.63
cumulus@border01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@border01:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@border01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@border01:~$ nv config apply
cumulus@border02:~$ nv set interface lo ip address 10.10.10.64/32
cumulus@border02:~$ nv set interface swp3,swp49-54
cumulus@border02:~$ nv set interface bond3 bond member swp3
cumulus@border02:~$ nv set interface bond3 bond mlag id 1
cumulus@border02:~$ nv set interface bond3 bond lacp-bypass on
cumulus@border02:~$ nv set interface bond3 link mtu 9000
cumulus@border02:~$ nv set interface bond3 bridge domain br_default
cumulus@border02:~$ nv set interface peerlink bond member swp49-50
cumulus@border02:~$ nv set mlag mac-address 44:38:39:BE:EF:FF
cumulus@border02:~$ nv set mlag backup 10.10.10.63
cumulus@border02:~$ nv set mlag peer-ip linklocal
cumulus@border02:~$ nv set mlag priority 2000
cumulus@border02:~$ nv set mlag init-delay 10
cumulus@border02:~$ nv set interface vlan10
cumulus@border02:~$ nv set interface vlan20
cumulus@border02:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@border02:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@border02:~$ nv set interface bond3 bridge domain br_default vlan 10,20
cumulus@border02:~$ nv set nve vxlan mlag shared-address 10.0.1.255
cumulus@border02:~$ nv set nve vxlan source address 10.10.10.64
cumulus@border02:~$ nv set nve vxlan arp-nd-suppress on
cumulus@border02:~$ nv set evpn enable on
cumulus@border02:~$ nv set router bgp autonomous-system 65254
cumulus@border02:~$ nv set router bgp router-id 10.10.10.64
cumulus@border02:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@border02:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@border02:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@border02:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@border02:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@border02:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@border02:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@border02:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@border02:~$ nv config apply
cumulus@leaf02:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.2/32
clagd-vxlan-anycast-ip 10.0.1.12
vxlan-local-tunnelip 10.10.10.2
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 2000
clagd-backup-ip 10.10.10.1
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 10
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 peerlink vxlan48
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
cumulus@leaf03:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.3/32
clagd-vxlan-anycast-ip 10.0.1.34
vxlan-local-tunnelip 10.10.10.3
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 1000
clagd-backup-ip 10.10.10.4
clagd-sys-mac 44:38:39:BE:EF:BB
clagd-args --initDelay 10
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:bb
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:bb
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 peerlink vxlan48
hwaddress 44:38:39:22:01:bb
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
cumulus@leaf04:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.4/32
clagd-vxlan-anycast-ip 10.0.1.34
vxlan-local-tunnelip 10.10.10.4
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 2000
clagd-backup-ip 10.10.10.3
clagd-sys-mac 44:38:39:BE:EF:BB
clagd-args --initDelay 10
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:c1
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:c1
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 peerlink vxlan48
hwaddress 44:38:39:22:01:c1
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
cumulus@spine01:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
cumulus@spine02:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.102/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
cumulus@spine03:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.103/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
cumulus@spine04:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.104/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
cumulus@border01:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.63/32
clagd-vxlan-anycast-ip 10.0.1.255
vxlan-local-tunnelip 10.10.10.63
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-vids 10 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 1000
clagd-backup-ip 10.10.10.64
clagd-sys-mac 44:38:39:BE:EF:FF
clagd-args --initDelay 10
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:ab
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:ab
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond3 peerlink vxlan48
hwaddress 44:38:39:22:01:ab
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
cumulus@border02:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.64/32
clagd-vxlan-anycast-ip 10.0.1.255
vxlan-local-tunnelip 10.10.10.64
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-vids 10 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 2000
clagd-backup-ip 10.10.10.63
clagd-sys-mac 44:38:39:BE:EF:FF
clagd-args --initDelay 10
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:b3
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:b3
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond3 peerlink vxlan48
hwaddress 44:38:39:22:01:b3
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
```
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
clagd-vxlan-anycast-ip 10.0.1.12
vxlan-local-tunnelip 10.10.10.1
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 1000
clagd-backup-ip 10.10.10.2
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 10
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 peerlink vxlan48
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
...
auto lo
iface lo inet loopback
address 10.10.10.2/32
clagd-vxlan-anycast-ip 10.0.1.12
vxlan-local-tunnelip 10.10.10.2
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 2000
clagd-backup-ip 10.10.10.1
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 10
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 peerlink vxlan48
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
...
auto lo
iface lo inet loopback
address 10.10.10.3/32
clagd-vxlan-anycast-ip 10.0.1.34
vxlan-local-tunnelip 10.10.10.3
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 1000
clagd-backup-ip 10.10.10.4
clagd-sys-mac 44:38:39:BE:EF:BB
clagd-args --initDelay 10
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 peerlink vxlan48
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
...
auto lo
iface lo inet loopback
address 10.10.10.4/32
clagd-vxlan-anycast-ip 10.0.1.34
vxlan-local-tunnelip 10.10.10.4
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 2000
clagd-backup-ip 10.10.10.3
clagd-sys-mac 44:38:39:BE:EF:BB
clagd-args --initDelay 10
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 peerlink vxlan48
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
...
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
...
auto lo
iface lo inet loopback
address 10.10.10.102/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
...
auto lo
iface lo inet loopback
address 10.10.10.103/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
...
auto lo
iface lo inet loopback
address 10.10.10.104/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
...
auto lo
iface lo inet loopback
address 10.10.10.63/32
clagd-vxlan-anycast-ip 10.0.1.255
vxlan-local-tunnelip 10.10.10.63
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-vids 10 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 1000
clagd-backup-ip 10.10.10.64
clagd-sys-mac 44:38:39:BE:EF:FF
clagd-args --initDelay 10
auto vlan10
iface vlan10
address 10.1.10.2/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:ab
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.10.2/24
address-virtual 00:00:00:00:00:20 10.1.20.2/24
hwaddress 44:38:39:22:01:ab
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond3 peerlink vxlan48
hwaddress 44:38:39:22:01:ab
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
...
auto lo
iface lo inet loopback
address 10.10.10.64/32
clagd-vxlan-anycast-ip 10.0.1.255
vxlan-local-tunnelip 10.10.10.64
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-vids 10 20
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 2000
clagd-backup-ip 10.10.10.63
clagd-sys-mac 44:38:39:BE:EF:FF
clagd-args --initDelay 10
auto vlan10
iface vlan10
address 10.1.10.1/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:b3
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.1/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
hwaddress 44:38:39:22:01:b3
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond3 peerlink vxlan48
hwaddress 44:38:39:22:01:b3
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
The following example shows an EVPN symmetric routing configuration, where:
MLAG is configured between leaf01 and leaf02, leaf03 and leaf04, and border01 and border02
BGP unnumbered is in the underlay (configured on all leafs and spines)
VRF BLUE and VRF RED are configured on the leafs for traffic flow between tenants for traffic isolation
The following images shows traffic flow between tenants. The spines and other devices are omitted for simplicity.
Traffic Flow between server01 and server04
server01 and server04 are in the same VRF and the same VLAN but are located across different leafs.
server01 makes a LACP hash decision and forwards traffic to leaf01.
leaf01 does a layer 2 lookup and has the MAC address for server04, it then forwards the packet out VNI10, through leaf04.
The VXLAN encapsulated frame arrives on leaf04, which does a layer 2 lookup and has the MAC address for server04 in VLAN10.
Traffic Flow between server01 and server05
server01 and server05 are in the same VRF, different VLANs, and are located across different leafs.
server01 makes an LACP hash decision to reach the default gateway and forwards traffic to leaf01.
leaf01 does a layer 3 lookup in VRF RED and has a route out VNIRED through leaf04.
The VXLAN encapsulated packet arrives on leaf04, which does a layer 3 lookup in VRF RED and has a route through VLAN20 to server05.
Traffic Flow between server01 and server06
server01 and server06 are in different VRFs, different VLANs, and are located across different leafs.
server01 makes an LACP hash decision to reach the default gateway and forwards traffic to leaf01.
leaf01 does a layer 3 lookup in VRF RED and has a route out VNIRED through border01.
The VXLAN encapsulated packet arrives on border01, which does a layer 3 lookup in VRF RED and has a route through VLAN101 to fw01 (the policy device).
fw01 does a layer 3 lookup (without any VRFs) and has a route in VLAN40, through border02.
border02 does a layer 3 lookup in VRF BLUE and has a route out VNIBLUE, through leaf04.
The VXLAN encapsulated packet arrives on leaf04, which does a layer 3 lookup in VRF BLUE and has a route in VLAN30 to server06.
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp1-3,swp49-54
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond3 bond member swp3
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf01:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf01:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond1 link mtu 9000
cumulus@leaf01:~$ nv set interface bond2 link mtu 9000
cumulus@leaf01:~$ nv set interface bond3 link mtu 9000
cumulus@leaf01:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf01:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf01:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf01:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:~$ nv set mlag priority 1000
cumulus@leaf01:~$ nv set mlag init-delay 10
cumulus@leaf01:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@leaf01:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf01:~$ nv set interface vlan10 ip vrr mac-address 00:00:00:00:00:10
cumulus@leaf01:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf01:~$ nv set interface vlan20 ip address 10.1.20.2/24
cumulus@leaf01:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf01:~$ nv set interface vlan20 ip vrr mac-address 00:00:00:00:00:20
cumulus@leaf01:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf01:~$ nv set interface vlan30 ip address 10.1.30.2/24
cumulus@leaf01:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf01:~$ nv set interface vlan30 ip vrr mac-address 00:00:00:00:00:30
cumulus@leaf01:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf01:~$ nv set vrf RED
cumulus@leaf01:~$ nv set vrf BLUE
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf01:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf01:~$ nv set bridge domain br_default vlan 30 vni 30
cumulus@leaf01:~$ nv set interface vlan10 ip vrf RED
cumulus@leaf01:~$ nv set interface vlan20 ip vrf RED
cumulus@leaf01:~$ nv set interface vlan30 ip vrf BLUE
cumulus@leaf01:~$ nv set nve vxlan mlag shared-address 10.0.1.12
cumulus@leaf01:~$ nv set nve vxlan source address 10.10.10.1
cumulus@leaf01:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf01:~$ nv set vrf RED evpn vni 4001
cumulus@leaf01:~$ nv set vrf BLUE evpn vni 4002
cumulus@leaf01:~$ nv set system global anycast-mac 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set evpn enable on
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf01:~$ nv set vrf RED router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set vrf RED router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf01:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf01:~$ nv set vrf BLUE router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set vrf BLUE router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf01:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf01:~$ nv config apply
cumulus@leaf02:~$ nv set interface lo ip address 10.10.10.2/32
cumulus@leaf02:~$ nv set interface swp1-3,swp49-54
cumulus@leaf02:~$ nv set interface bond1 bond member swp1
cumulus@leaf02:~$ nv set interface bond2 bond member swp2
cumulus@leaf02:~$ nv set interface bond3 bond member swp3
cumulus@leaf02:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf02:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf02:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf02:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond1 link mtu 9000
cumulus@leaf02:~$ nv set interface bond2 link mtu 9000
cumulus@leaf02:~$ nv set interface bond3 link mtu 9000
cumulus@leaf02:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf02:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf02:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf02:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf02:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf02:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:~$ nv set mlag backup 10.10.10.1
cumulus@leaf02:~$ nv set mlag peer-ip linklocal
cumulus@leaf02:~$ nv set mlag priority 2000
cumulus@leaf02:~$ nv set mlag init-delay 10
cumulus@leaf02:~$ nv set interface vlan10 ip address 10.1.10.3/24
cumulus@leaf02:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf02:~$ nv set interface vlan10 ip vrr mac-address 00:00:00:00:00:10
cumulus@leaf02:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf02:~$ nv set interface vlan20 ip address 10.1.20.3/24
cumulus@leaf02:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf02:~$ nv set interface vlan20 ip vrr mac-address 00:00:00:00:00:20
cumulus@leaf02:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf02:~$ nv set interface vlan30 ip address 10.1.30.3/24
cumulus@leaf02:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf02:~$ nv set interface vlan30 ip vrr mac-address 00:00:00:00:00:30
cumulus@leaf02:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf02:~$ nv set vrf RED
cumulus@leaf02:~$ nv set vrf BLUE
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf02:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf02:~$ nv set bridge domain br_default vlan 30 vni 30
cumulus@leaf02:~$ nv set interface vlan10 ip vrf RED
cumulus@leaf02:~$ nv set interface vlan20 ip vrf RED
cumulus@leaf02:~$ nv set interface vlan30 ip vrf BLUE
cumulus@leaf02:~$ nv set nve vxlan mlag shared-address 10.0.1.12
cumulus@leaf02:~$ nv set nve vxlan source address 10.10.10.2
cumulus@leaf02:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf02:~$ nv set vrf RED evpn vni 4001
cumulus@leaf02:~$ nv set vrf BLUE evpn vni 4002
cumulus@leaf02:~$ nv set system global anycast-mac 44:38:39:BE:EF:AA
cumulus@leaf02:~$ nv set evpn enable on
cumulus@leaf02:~$ nv set router bgp autonomous-system 65102
cumulus@leaf02:~$ nv set router bgp router-id 10.10.10.2
cumulus@leaf02:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf02:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@leaf02:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf02:~$ nv set vrf RED router bgp autonomous-system 65102
cumulus@leaf02:~$ nv set vrf RED router bgp router-id 10.10.10.2
cumulus@leaf02:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf02:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf02:~$ nv set vrf BLUE router bgp autonomous-system 65102
cumulus@leaf02:~$ nv set vrf BLUE router bgp router-id 10.10.10.2
cumulus@leaf02:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf02:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf02:~$ nv config apply
cumulus@leaf03:~$ nv set interface lo ip address 10.10.10.3/32
cumulus@leaf03:~$ nv set interface swp1-3,swp49-54
cumulus@leaf03:~$ nv set interface bond1 bond member swp1
cumulus@leaf03:~$ nv set interface bond2 bond member swp2
cumulus@leaf03:~$ nv set interface bond3 bond member swp3
cumulus@leaf03:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf03:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf03:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf03:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf03:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf03:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf03:~$ nv set interface bond1 link mtu 9000
cumulus@leaf03:~$ nv set interface bond2 link mtu 9000
cumulus@leaf03:~$ nv set interface bond3 link mtu 9000
cumulus@leaf03:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf03:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf03:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf03:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf03:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf03:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf03:~$ nv set mlag mac-address 44:38:39:BE:EF:BB
cumulus@leaf03:~$ nv set mlag backup 10.10.10.4
cumulus@leaf03:~$ nv set mlag peer-ip linklocal
cumulus@leaf03:~$ nv set mlag priority 1000
cumulus@leaf03:~$ nv set mlag init-delay 10
cumulus@leaf03:~$ nv set interface vlan10 ip address 10.1.10.4/24
cumulus@leaf03:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf03:~$ nv set interface vlan10 ip vrr mac-address 00:00:00:00:00:10
cumulus@leaf03:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf03:~$ nv set interface vlan20 ip address 10.1.20.4/24
cumulus@leaf03:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf03:~$ nv set interface vlan20 ip vrr mac-address 00:00:00:00:00:20
cumulus@leaf03:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf03:~$ nv set interface vlan30 ip address 10.1.30.4/24
cumulus@leaf03:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf03:~$ nv set interface vlan30 ip vrr mac-address 00:00:00:00:00:30
cumulus@leaf03:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf03:~$ nv set vrf RED
cumulus@leaf03:~$ nv set vrf BLUE
cumulus@leaf03:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf03:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf03:~$ nv set bridge domain br_default vlan 30 vni 30
cumulus@leaf03:~$ nv set interface vlan10 ip vrf RED
cumulus@leaf03:~$ nv set interface vlan20 ip vrf RED
cumulus@leaf03:~$ nv set interface vlan30 ip vrf BLUE
cumulus@leaf03:~$ nv set nve vxlan mlag shared-address 10.0.1.34
cumulus@leaf03:~$ nv set nve vxlan source address 10.10.10.3
cumulus@leaf03:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf03:~$ nv set vrf RED evpn vni 4001
cumulus@leaf03:~$ nv set vrf BLUE evpn vni 4002
cumulus@leaf03:~$ nv set system global anycast-mac 44:38:39:BE:EF:BB
cumulus@leaf03:~$ nv set evpn enable on
cumulus@leaf03:~$ nv set router bgp autonomous-system 65103
cumulus@leaf03:~$ nv set router bgp router-id 10.10.10.3
cumulus@leaf03:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf03:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@leaf03:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf03:~$ nv set vrf RED router bgp autonomous-system 65103
cumulus@leaf03:~$ nv set vrf RED router bgp router-id 10.10.10.3
cumulus@leaf03:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf03:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf03:~$ nv set vrf BLUE router bgp autonomous-system 65103
cumulus@leaf03:~$ nv set vrf BLUE router bgp router-id 10.10.10.3
cumulus@leaf03:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf03:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf03:~$ nv config apply
cumulus@leaf04:~$ nv set interface lo ip address 10.10.10.4/32
cumulus@leaf04:~$ nv set interface swp1-3,swp49-54
cumulus@leaf04:~$ nv set interface bond1 bond member swp1
cumulus@leaf04:~$ nv set interface bond2 bond member swp2
cumulus@leaf04:~$ nv set interface bond3 bond member swp3
cumulus@leaf04:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf04:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf04:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf04:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf04:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf04:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf04:~$ nv set interface bond1 link mtu 9000
cumulus@leaf04:~$ nv set interface bond2 link mtu 9000
cumulus@leaf04:~$ nv set interface bond3 link mtu 9000
cumulus@leaf04:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf04:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf04:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf04:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf04:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf04:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf04:~$ nv set mlag mac-address 44:38:39:BE:EF:BB
cumulus@leaf04:~$ nv set mlag backup 10.10.10.3
cumulus@leaf04:~$ nv set mlag peer-ip linklocal
cumulus@leaf04:~$ nv set mlag priority 2000
cumulus@leaf04:~$ nv set mlag init-delay 10
cumulus@leaf04:~$ nv set interface vlan10 ip address 10.1.10.5/24
cumulus@leaf04:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf04:~$ nv set interface vlan10 ip vrr mac-address 00:00:00:00:00:10
cumulus@leaf04:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf04:~$ nv set interface vlan20 ip address 10.1.20.5/24
cumulus@leaf04:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf04:~$ nv set interface vlan20 ip vrr mac-address 00:00:00:00:00:20
cumulus@leaf04:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf04:~$ nv set interface vlan30 ip address 10.1.30.5/24
cumulus@leaf04:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf04:~$ nv set interface vlan30 ip vrr mac-address 00:00:00:00:00:30
cumulus@leaf04:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf04:~$ nv set vrf RED
cumulus@leaf04:~$ nv set vrf BLUE
cumulus@leaf04:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf04:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf04:~$ nv set bridge domain br_default vlan 30 vni 30
cumulus@leaf04:~$ nv set interface vlan10 ip vrf RED
cumulus@leaf04:~$ nv set interface vlan20 ip vrf RED
cumulus@leaf04:~$ nv set interface vlan30 ip vrf BLUE
cumulus@leaf04:~$ nv set nve vxlan mlag shared-address 10.0.1.34
cumulus@leaf04:~$ nv set nve vxlan source address 10.10.10.4
cumulus@leaf04:~$ nv set nve vxlan arp-nd-suppress on
cumulus@leaf04:~$ nv set vrf RED evpn vni 4001
cumulus@leaf04:~$ nv set vrf BLUE evpn vni 4002
cumulus@leaf04:~$ nv set system global anycast-mac 44:38:39:BE:EF:BB
cumulus@leaf04:~$ nv set evpn enable on
cumulus@leaf04:~$ nv set router bgp autonomous-system 65104
cumulus@leaf04:~$ nv set router bgp router-id 10.10.10.4
cumulus@leaf04:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@leaf04:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@leaf04:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf04:~$ nv set vrf RED router bgp autonomous-system 65104
cumulus@leaf04:~$ nv set vrf RED router bgp router-id 10.10.10.4
cumulus@leaf04:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf04:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf04:~$ nv set vrf BLUE router bgp autonomous-system 65104
cumulus@leaf04:~$ nv set vrf BLUE router bgp router-id 10.10.10.4
cumulus@leaf04:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@leaf04:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@leaf04:~$ nv config apply
cumulus@spine01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp1-6
cumulus@spine01:~$ nv set router bgp autonomous-system 65199
cumulus@spine01:~$ nv set router bgp router-id 10.10.10.101
cumulus@spine01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp1 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp2 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp3 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp4 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp5 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp6 peer-group underlay
cumulus@spine01:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@spine01:~$ nv config apply
cumulus@spine02:~$ nv set interface lo ip address 10.10.10.102/32
cumulus@spine02:~$ nv set interface swp1-6
cumulus@spine02:~$ nv set router bgp autonomous-system 65199
cumulus@spine02:~$ nv set router bgp router-id 10.10.10.102
cumulus@spine02:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp1 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp2 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp3 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp4 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp5 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp6 peer-group underlay
cumulus@spine02:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine02:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@spine02:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@spine02:~$ nv config apply
cumulus@spine03:~$ nv set interface lo ip address 10.10.10.103/32
cumulus@spine03:~$ nv set interface swp1-6
cumulus@spine03:~$ nv set router bgp autonomous-system 65199
cumulus@spine03:~$ nv set router bgp router-id 10.10.10.103
cumulus@spine03:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp1 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp2 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp3 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp4 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp5 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp neighbor swp6 peer-group underlay
cumulus@spine03:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine03:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@spine03:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@spine03:~$ nv config apply
cumulus@spine04:~$ nv set interface lo ip address 10.10.10.104/32
cumulus@spine04:~$ nv set interface swp1-6
cumulus@spine04:~$ nv set router bgp autonomous-system 65199
cumulus@spine04:~$ nv set router bgp router-id 10.10.10.104
cumulus@spine04:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp1 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp2 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp3 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp4 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp5 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp neighbor swp6 peer-group underlay
cumulus@spine04:~$ nv set vrf default router bgp address-family l2vpn-evpn enable on
cumulus@spine04:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@spine04:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@spine04:~$ nv config apply
cumulus@border01:~$ nv set interface lo ip address 10.10.10.63/32
cumulus@border01:~$ nv set interface swp3,swp49-54
cumulus@border01:~$ nv set interface bond3 bond member swp3
cumulus@border01:~$ nv set interface bond3 bond mlag id 1
cumulus@border01:~$ nv set interface bond3 bond lacp-bypass on
cumulus@border01:~$ nv set interface bond3 link mtu 9000
cumulus@border01:~$ nv set interface bond3 bridge domain br_default
cumulus@border01:~$ nv set interface bond3 bridge domain br_default vlan 101,102
cumulus@border01:~$ nv set interface peerlink bond member swp49-50
cumulus@border01:~$ nv set mlag mac-address 44:38:39:BE:EF:FF
cumulus@border01:~$ nv set mlag backup 10.10.10.64
cumulus@border01:~$ nv set mlag peer-ip linklocal
cumulus@border01:~$ nv set mlag priority 1000
cumulus@border01:~$ nv set mlag init-delay 10
cumulus@border01:~$ nv set vrf RED
cumulus@border01:~$ nv set vrf BLUE
cumulus@border01:~$ nv set interface vlan101 ip address 10.1.101.64/24
cumulus@border01:~$ nv set interface vlan101 ip vrr address 10.1.101.1/24
cumulus@border01:~$ nv set interface vlan101 ip vrr mac-address 00:00:00:00:00:01
cumulus@border01:~$ nv set interface vlan101 ip vrr state up
cumulus@border01:~$ nv set interface vlan102 ip address 10.1.102.64/24
cumulus@border01:~$ nv set interface vlan102 ip vrr address 10.1.102.1/24
cumulus@border01:~$ nv set interface vlan102 ip vrr mac-address 00:00:00:00:00:02
cumulus@border01:~$ nv set interface vlan102 ip vrr state up
cumulus@border01:~$ nv set bridge domain br_default vlan 101,102
cumulus@border01:~$ nv set interface vlan101 ip vrf RED
cumulus@border01:~$ nv set interface vlan102 ip vrf BLUE
cumulus@border01:~$ nv set nve vxlan mlag shared-address 10.0.1.255
cumulus@border01:~$ nv set nve vxlan source address 10.10.10.63
cumulus@border01:~$ nv set nve vxlan arp-nd-suppress on
cumulus@border01:~$ nv set vrf RED evpn vni 4001
cumulus@border01:~$ nv set vrf BLUE evpn vni 4002
cumulus@border01:~$ nv set system global anycast-mac 44:38:39:BE:EF:FF
cumulus@border01:~$ nv set evpn enable on
cumulus@border01:~$ nv set router bgp autonomous-system 65253
cumulus@border01:~$ nv set router bgp router-id 10.10.10.63
cumulus@border01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@border01:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@border01:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@border01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@border01:~$ nv set vrf RED router bgp autonomous-system 65253
cumulus@border01:~$ nv set vrf RED router bgp router-id 10.10.10.63
cumulus@border01:~$ nv set vrf RED router static 10.1.30.0/24 via 10.1.101.4
cumulus@border01:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute static
cumulus@border01:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@border01:~$ nv set vrf BLUE router bgp autonomous-system 65253
cumulus@border01:~$ nv set vrf BLUE router bgp router-id 10.10.10.63
cumulus@border01:~$ nv set vrf BLUE router static 10.1.10.0/24 via 10.1.102.4
cumulus@border01:~$ nv set vrf BLUE router static 10.1.20.0/24 via 10.1.102.4
cumulus@border01:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute static
cumulus@border01:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@border01:~$ nv config apply
cumulus@border02:~$ nv set interface lo ip address 10.10.10.64/32
cumulus@border02:~$ nv set interface swp3,swp49-54
cumulus@border02:~$ nv set interface bond3 bond member swp3
cumulus@border02:~$ nv set interface bond3 bond mlag id 1
cumulus@border02:~$ nv set interface bond3 bond lacp-bypass on
cumulus@border02:~$ nv set interface bond3 link mtu 9000
cumulus@border02:~$ nv set interface bond3 bridge domain br_default
cumulus@border02:~$ nv set interface bond3 bridge domain br_default vlan 101,102
cumulus@border02:~$ nv set interface peerlink bond member swp49-50
cumulus@border02:~$ nv set mlag mac-address 44:38:39:BE:EF:FF
cumulus@border02:~$ nv set mlag backup 10.10.10.63
cumulus@border02:~$ nv set mlag peer-ip linklocal
cumulus@border02:~$ nv set mlag priority 2000
cumulus@border02:~$ nv set mlag init-delay 10
cumulus@border02:~$ nv set vrf RED
cumulus@border02:~$ nv set vrf BLUE
cumulus@border02:~$ nv set interface vlan101 ip address 10.1.101.65/24
cumulus@border02:~$ nv set interface vlan101 ip vrr address 10.1.101.1/24
cumulus@border02:~$ nv set interface vlan101 ip vrr mac-address 00:00:00:00:00:01
cumulus@border02:~$ nv set interface vlan101 ip vrr state up
cumulus@border02:~$ nv set interface vlan102 ip address 10.1.102.65/24
cumulus@border02:~$ nv set interface vlan102 ip vrr address 10.1.102.1/24
cumulus@border02:~$ nv set interface vlan102 ip vrr mac-address 00:00:00:00:00:02
cumulus@border02:~$ nv set interface vlan102 ip vrr state up
cumulus@border02:~$ nv set bridge domain br_default vlan 101,102
cumulus@border02:~$ nv set interface vlan101 ip vrf RED
cumulus@border02:~$ nv set interface vlan102 ip vrf BLUE
cumulus@border02:~$ nv set nve vxlan mlag shared-address 10.0.1.255
cumulus@border02:~$ nv set nve vxlan source address 10.10.10.64
cumulus@border02:~$ nv set nve vxlan arp-nd-suppress on
cumulus@border02:~$ nv set vrf RED evpn vni 4001
cumulus@border02:~$ nv set vrf BLUE evpn vni 4002
cumulus@border02:~$ nv set system global anycast-mac 44:38:39:BE:EF:FF
cumulus@border02:~$ nv set evpn enable on
cumulus@border02:~$ nv set router bgp autonomous-system 65254
cumulus@border02:~$ nv set router bgp router-id 10.10.10.64
cumulus@border02:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@border02:~$ nv set vrf default router bgp neighbor swp51 peer-group underlay
cumulus@border02:~$ nv set vrf default router bgp neighbor swp52 peer-group underlay
cumulus@border02:~$ nv set vrf default router bgp neighbor swp53 peer-group underlay
cumulus@border02:~$ nv set vrf default router bgp neighbor swp54 peer-group underlay
cumulus@border02:~$ nv set vrf default router bgp peer-group underlay address-family l2vpn-evpn enable on
cumulus@border02:~$ nv set vrf default router bgp neighbor peerlink.4094 peer-group underlay
cumulus@border02:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@border02:~$ nv set vrf RED router bgp autonomous-system 65254
cumulus@border02:~$ nv set vrf RED router bgp router-id 10.10.10.64
cumulus@border02:~$ nv set vrf RED router static 10.1.30.0/24 via 10.1.101.4
cumulus@border02:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute static
cumulus@border02:~$ nv set vrf RED router bgp address-family ipv4-unicast route-export to-evpn
cumulus@border02:~$ nv set vrf BLUE router bgp autonomous-system 65254
cumulus@border02:~$ nv set vrf BLUE router bgp router-id 10.10.10.64
cumulus@border02:~$ nv set vrf BLUE router static 10.1.10.0/24 via 10.1.102.4
cumulus@border02:~$ nv set vrf BLUE router static 10.1.20.0/24 via 10.1.102.4
cumulus@border02:~$ nv set vrf BLUE router bgp address-family ipv4-unicast redistribute static
cumulus@border02:~$ nv set vrf BLUE router bgp address-family ipv4-unicast route-export to-evpn
cumulus@border02:~$ nv config apply
cumulus@leaf01:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'10':
vni:
'10': {}
'20':
vni:
'20': {}
'30':
vni:
'30': {}
evpn:
enable: on
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
mlag:
enable: on
id: 1
bridge:
domain:
br_default:
access: 10
link:
mtu: 9000
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
mlag:
enable: on
id: 2
bridge:
domain:
br_default:
access: 20
link:
mtu: 9000
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
mlag:
enable: on
id: 3
bridge:
domain:
br_default:
access: 30
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.1/32: {}
type: loopback
peerlink:
bond:
member:
swp49: {}
swp50: {}
type: peerlink
peerlink.4094:
base-interface: peerlink
type: sub
vlan: 4094
swp1:
type: swp
swp2:
type: swp
swp3:
type: swp
swp49:
type: swp
swp50:
type: swp
swp51:
type: swp
swp52:
type: swp
swp53:
type: swp
swp54:
type: swp
vlan10:
ip:
address:
10.1.10.2/24: {}
vrf: RED
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:00:00:00:10
state:
up: {}
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.2/24: {}
vrf: RED
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:00:00:00:20
state:
up: {}
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.2/24: {}
vrf: BLUE
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:00:00:00:30
state:
up: {}
type: svi
vlan: 30
mlag:
backup:
10.10.10.2: {}
enable: on
init-delay: 10
mac-address: 44:38:39:BE:EF:AA
peer-ip: linklocal
priority: 1000
nve:
vxlan:
arp-nd-suppress: on
enable: on
mlag:
shared-address: 10.0.1.12
source:
address: 10.10.10.1
router:
bgp:
autonomous-system: 65101
enable: on
router-id: 10.10.10.1
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:AA
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65101
enable: on
router-id: 10.10.10.1
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65101
enable: on
router-id: 10.10.10.1
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
peerlink.4094:
peer-group: underlay
type: unnumbered
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
swp53:
peer-group: underlay
type: unnumbered
swp54:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@leaf02:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'10':
vni:
'10': {}
'20':
vni:
'20': {}
'30':
vni:
'30': {}
evpn:
enable: on
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
mlag:
enable: on
id: 1
bridge:
domain:
br_default:
access: 10
link:
mtu: 9000
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
mlag:
enable: on
id: 2
bridge:
domain:
br_default:
access: 20
link:
mtu: 9000
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
mlag:
enable: on
id: 3
bridge:
domain:
br_default:
access: 30
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.2/32: {}
type: loopback
peerlink:
bond:
member:
swp49: {}
swp50: {}
type: peerlink
peerlink.4094:
base-interface: peerlink
type: sub
vlan: 4094
swp1:
type: swp
swp2:
type: swp
swp3:
type: swp
swp49:
type: swp
swp50:
type: swp
swp51:
type: swp
swp52:
type: swp
swp53:
type: swp
swp54:
type: swp
vlan10:
ip:
address:
10.1.10.3/24: {}
vrf: RED
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:00:00:00:10
state:
up: {}
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.3/24: {}
vrf: RED
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:00:00:00:20
state:
up: {}
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.3/24: {}
vrf: BLUE
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:00:00:00:30
state:
up: {}
type: svi
vlan: 30
mlag:
backup:
10.10.10.1: {}
enable: on
init-delay: 10
mac-address: 44:38:39:BE:EF:AA
peer-ip: linklocal
priority: 2000
nve:
vxlan:
arp-nd-suppress: on
enable: on
mlag:
shared-address: 10.0.1.12
source:
address: 10.10.10.2
router:
bgp:
autonomous-system: 65102
enable: on
router-id: 10.10.10.2
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:AA
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65102
enable: on
router-id: 10.10.10.2
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65102
enable: on
router-id: 10.10.10.2
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
peerlink.4094:
peer-group: underlay
type: unnumbered
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
swp53:
peer-group: underlay
type: unnumbered
swp54:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@leaf03:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'10':
vni:
'10': {}
'20':
vni:
'20': {}
'30':
vni:
'30': {}
evpn:
enable: on
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
mlag:
enable: on
id: 1
bridge:
domain:
br_default:
access: 10
link:
mtu: 9000
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
mlag:
enable: on
id: 2
bridge:
domain:
br_default:
access: 20
link:
mtu: 9000
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
mlag:
enable: on
id: 3
bridge:
domain:
br_default:
access: 30
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.3/32: {}
type: loopback
peerlink:
bond:
member:
swp49: {}
swp50: {}
type: peerlink
peerlink.4094:
base-interface: peerlink
type: sub
vlan: 4094
swp1:
type: swp
swp2:
type: swp
swp3:
type: swp
swp49:
type: swp
swp50:
type: swp
swp51:
type: swp
swp52:
type: swp
swp53:
type: swp
swp54:
type: swp
vlan10:
ip:
address:
10.1.10.4/24: {}
vrf: RED
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:00:00:00:10
state:
up: {}
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.4/24: {}
vrf: RED
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:00:00:00:20
state:
up: {}
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.4/24: {}
vrf: BLUE
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:00:00:00:30
state:
up: {}
type: svi
vlan: 30
mlag:
backup:
10.10.10.4: {}
enable: on
init-delay: 10
mac-address: 44:38:39:BE:EF:BB
peer-ip: linklocal
priority: 1000
nve:
vxlan:
arp-nd-suppress: on
enable: on
mlag:
shared-address: 10.0.1.34
source:
address: 10.10.10.3
router:
bgp:
autonomous-system: 65103
enable: on
router-id: 10.10.10.3
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:BB
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65103
enable: on
router-id: 10.10.10.3
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65103
enable: on
router-id: 10.10.10.3
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
peerlink.4094:
peer-group: underlay
type: unnumbered
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
swp53:
peer-group: underlay
type: unnumbered
swp54:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@leaf04:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'10':
vni:
'10': {}
'20':
vni:
'20': {}
'30':
vni:
'30': {}
evpn:
enable: on
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
mlag:
enable: on
id: 1
bridge:
domain:
br_default:
access: 10
link:
mtu: 9000
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
mlag:
enable: on
id: 2
bridge:
domain:
br_default:
access: 20
link:
mtu: 9000
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
mlag:
enable: on
id: 3
bridge:
domain:
br_default:
access: 30
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.4/32: {}
type: loopback
peerlink:
bond:
member:
swp49: {}
swp50: {}
type: peerlink
peerlink.4094:
base-interface: peerlink
type: sub
vlan: 4094
swp1:
type: swp
swp2:
type: swp
swp3:
type: swp
swp49:
type: swp
swp50:
type: swp
swp51:
type: swp
swp52:
type: swp
swp53:
type: swp
swp54:
type: swp
vlan10:
ip:
address:
10.1.10.5/24: {}
vrf: RED
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:00:00:00:10
state:
up: {}
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.5/24: {}
vrf: RED
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:00:00:00:20
state:
up: {}
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.5/24: {}
vrf: BLUE
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:00:00:00:30
state:
up: {}
type: svi
vlan: 30
mlag:
backup:
10.10.10.3: {}
enable: on
init-delay: 10
mac-address: 44:38:39:BE:EF:BB
peer-ip: linklocal
priority: 2000
nve:
vxlan:
arp-nd-suppress: on
enable: on
mlag:
shared-address: 10.0.1.34
source:
address: 10.10.10.4
router:
bgp:
autonomous-system: 65104
enable: on
router-id: 10.10.10.4
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:BB
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65104
enable: on
router-id: 10.10.10.4
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65104
enable: on
router-id: 10.10.10.4
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
peerlink.4094:
peer-group: underlay
type: unnumbered
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
swp53:
peer-group: underlay
type: unnumbered
swp54:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@border01:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'101': {}
'102': {}
evpn:
enable: on
interface:
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
mlag:
enable: on
id: 1
bridge:
domain:
br_default:
vlan:
'101': {}
'102': {}
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.63/32: {}
type: loopback
peerlink:
bond:
member:
swp49: {}
swp50: {}
type: peerlink
peerlink.4094:
base-interface: peerlink
type: sub
vlan: 4094
swp3:
type: swp
swp49:
type: swp
swp50:
type: swp
swp51:
type: swp
swp52:
type: swp
swp53:
type: swp
swp54:
type: swp
vlan101:
ip:
address:
10.1.101.64/24: {}
vrf: RED
vrr:
address:
10.1.101.1/24: {}
enable: on
mac-address: 00:00:00:00:00:01
state:
up: {}
type: svi
vlan: 101
vlan102:
ip:
address:
10.1.102.64/24: {}
vrf: BLUE
vrr:
address:
10.1.102.1/24: {}
enable: on
mac-address: 00:00:00:00:00:02
state:
up: {}
type: svi
vlan: 102
mlag:
backup:
10.10.10.64: {}
enable: on
init-delay: 10
mac-address: 44:38:39:BE:EF:FF
peer-ip: linklocal
priority: 1000
nve:
vxlan:
arp-nd-suppress: on
enable: on
mlag:
shared-address: 10.0.1.255
source:
address: 10.10.10.63
router:
bgp:
autonomous-system: 65253
enable: on
router-id: 10.10.10.63
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:FF
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
static:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65253
enable: on
router-id: 10.10.10.63
static:
10.1.10.0/24:
address-family: ipv4-unicast
via:
10.1.102.4:
type: ipv4-address
10.1.20.0/24:
address-family: ipv4-unicast
via:
10.1.102.4:
type: ipv4-address
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
static:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65253
enable: on
router-id: 10.10.10.63
static:
10.1.30.0/24:
address-family: ipv4-unicast
via:
10.1.101.4:
type: ipv4-address
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
peerlink.4094:
peer-group: underlay
type: unnumbered
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
swp53:
peer-group: underlay
type: unnumbered
swp54:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@border02:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'101': {}
'102': {}
evpn:
enable: on
interface:
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
mlag:
enable: on
id: 1
bridge:
domain:
br_default:
vlan:
'101': {}
'102': {}
link:
mtu: 9000
type: bond
lo:
ip:
address:
10.10.10.64/32: {}
type: loopback
peerlink:
bond:
member:
swp49: {}
swp50: {}
type: peerlink
peerlink.4094:
base-interface: peerlink
type: sub
vlan: 4094
swp3:
type: swp
swp49:
type: swp
swp50:
type: swp
swp51:
type: swp
swp52:
type: swp
swp53:
type: swp
swp54:
type: swp
vlan101:
ip:
address:
10.1.101.65/24: {}
vrf: RED
vrr:
address:
10.1.101.1/24: {}
enable: on
mac-address: 00:00:00:00:00:01
state:
up: {}
type: svi
vlan: 101
vlan102:
ip:
address:
10.1.102.65/24: {}
vrf: BLUE
vrr:
address:
10.1.102.1/24: {}
enable: on
mac-address: 00:00:00:00:00:02
state:
up: {}
type: svi
vlan: 102
mlag:
backup:
10.10.10.63: {}
enable: on
init-delay: 10
mac-address: 44:38:39:BE:EF:FF
peer-ip: linklocal
priority: 2000
nve:
vxlan:
arp-nd-suppress: on
enable: on
mlag:
shared-address: 10.0.1.255
source:
address: 10.10.10.64
router:
bgp:
autonomous-system: 65254
enable: on
router-id: 10.10.10.64
vrr:
enable: on
system:
global:
anycast-mac: 44:38:39:BE:EF:FF
vrf:
BLUE:
evpn:
enable: on
vni:
'4002': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
static:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65254
enable: on
router-id: 10.10.10.64
static:
10.1.10.0/24:
address-family: ipv4-unicast
via:
10.1.102.4:
type: ipv4-address
10.1.20.0/24:
address-family: ipv4-unicast
via:
10.1.102.4:
type: ipv4-address
RED:
evpn:
enable: on
vni:
'4001': {}
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
static:
enable: on
route-export:
to-evpn:
enable: on
autonomous-system: 65254
enable: on
router-id: 10.10.10.64
static:
10.1.30.0/24:
address-family: ipv4-unicast
via:
10.1.101.4:
type: ipv4-address
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
redistribute:
connected:
enable: on
enable: on
neighbor:
peerlink.4094:
peer-group: underlay
type: unnumbered
swp51:
peer-group: underlay
type: unnumbered
swp52:
peer-group: underlay
type: unnumbered
swp53:
peer-group: underlay
type: unnumbered
swp54:
peer-group: underlay
type: unnumbered
peer-group:
underlay:
address-family:
l2vpn-evpn:
enable: on
remote-as: external
cumulus@leaf01:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
clagd-vxlan-anycast-ip 10.0.1.12
vxlan-local-tunnelip 10.10.10.1
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 3
bridge-access 30
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 1000
clagd-backup-ip 10.10.10.2
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 10
auto vlan10
iface vlan10
address 10.1.10.2/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:b1
vrf RED
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.2/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
hwaddress 44:38:39:22:01:b1
vrf RED
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.2/24
address-virtual 00:00:00:00:00:30 10.1.30.1/24
hwaddress 44:38:39:22:01:b1
vrf BLUE
vlan-raw-device br_default
vlan-id 30
auto vlan4024_l3
iface vlan4024_l3
vrf RED
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:AA
vlan-id 4024
auto vlan4036_l3
iface vlan4036_l3
vrf BLUE
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:AA
vlan-id 4036
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20 30=30 4024=4001 4036=4002
bridge-vids 10 20 30 4024 4036
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 peerlink vxlan48
hwaddress 44:38:39:22:01:b1
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
cumulus@leaf02:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.2/32
clagd-vxlan-anycast-ip 10.0.1.12
vxlan-local-tunnelip 10.10.10.2
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 3
bridge-access 30
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 2000
clagd-backup-ip 10.10.10.1
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 10
auto vlan10
iface vlan10
address 10.1.10.3/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:af
vrf RED
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.3/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
hwaddress 44:38:39:22:01:af
vrf RED
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.3/24
address-virtual 00:00:00:00:00:30 10.1.30.1/24
hwaddress 44:38:39:22:01:af
vrf BLUE
vlan-raw-device br_default
vlan-id 30
auto vlan4024_l3
iface vlan4024_l3
vrf RED
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:AA
vlan-id 4024
auto vlan4036_l3
iface vlan4036_l3
vrf BLUE
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:AA
vlan-id 4036
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20 30=30 4024=4001 4036=4002
bridge-vids 10 20 30 4024 4036
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 peerlink vxlan48
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
cumulus@leaf03:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.3/32
clagd-vxlan-anycast-ip 10.0.1.34
vxlan-local-tunnelip 10.10.10.3
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 3
bridge-access 30
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 2000
clagd-backup-ip 10.10.10.4
clagd-sys-mac 44:38:39:BE:EF:BB
clagd-args --initDelay 10
auto vlan10
iface vlan10
address 10.1.10.4/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:bb
vrf RED
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.4/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
hwaddress 44:38:39:22:01:bb
vrf RED
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.4/24
address-virtual 00:00:00:00:00:30 10.1.30.1/24
hwaddress 44:38:39:22:01:bb
vrf BLUE
vlan-raw-device br_default
vlan-id 30
auto vlan4024_l3
iface vlan4024_l3
vrf RED
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:BB
vlan-id 4024
auto vlan4036_l3
iface vlan4036_l3
vrf BLUE
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:BB
vlan-id 4036
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20 30=30 4024=4001 4036=4002
bridge-vids 10 20 30 4024 4036
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 peerlink vxlan48
hwaddress 44:38:39:22:01:bb
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
cumulus@leaf04:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.4/32
clagd-vxlan-anycast-ip 10.0.1.34
vxlan-local-tunnelip 10.10.10.4
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond1
iface bond1
mtu 9000
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-access 10
auto bond2
iface bond2
mtu 9000
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 2
bridge-access 20
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 3
bridge-access 30
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 2000
clagd-backup-ip 10.10.10.3
clagd-sys-mac 44:38:39:BE:EF:BB
clagd-args --initDelay 10
auto vlan10
iface vlan10
address 10.1.10.5/24
address-virtual 00:00:00:00:00:10 10.1.10.1/24
hwaddress 44:38:39:22:01:c1
vrf RED
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
address 10.1.20.5/24
address-virtual 00:00:00:00:00:20 10.1.20.1/24
hwaddress 44:38:39:22:01:c1
vrf RED
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
address 10.1.30.5/24
address-virtual 00:00:00:00:00:30 10.1.30.1/24
hwaddress 44:38:39:22:01:c1
vrf BLUE
vlan-raw-device br_default
vlan-id 30
auto vlan4024_l3
iface vlan4024_l3
vrf RED
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:BB
vlan-id 4024
auto vlan4036_l3
iface vlan4036_l3
vrf BLUE
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:BB
vlan-id 4036
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20 30=30 4024=4001 4036=4002
bridge-vids 10 20 30 4024 4036
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 peerlink vxlan48
hwaddress 44:38:39:22:01:c1
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
cumulus@spine01:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
cumulus@spine02:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.102/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
cumulus@spine03:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.103/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
cumulus@spine04:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.104/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto swp5
iface swp5
auto swp6
iface swp6
cumulus@border01:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.63/32
clagd-vxlan-anycast-ip 10.0.1.255
vxlan-local-tunnelip 10.10.10.63
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-vids 101 102
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 1000
clagd-backup-ip 10.10.10.64
clagd-sys-mac 44:38:39:BE:EF:FF
clagd-args --initDelay 10
auto vlan101
iface vlan101
address 10.1.101.64/24
address-virtual 00:00:00:00:00:01 10.1.101.1/24
hwaddress 44:38:39:22:01:ab
vrf RED
vlan-raw-device br_default
vlan-id 101
auto vlan102
iface vlan102
address 10.1.102.64/24
address-virtual 00:00:00:00:00:02 10.1.102.1/24
hwaddress 44:38:39:22:01:ab
vrf BLUE
vlan-raw-device br_default
vlan-id 102
auto vlan4024_l3
iface vlan4024_l3
vrf RED
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:FF
vlan-id 4024
auto vlan4036_l3
iface vlan4036_l3
vrf BLUE
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:FF
vlan-id 4036
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 4024=4001 4036=4002
bridge-vids 4024 4036
bridge-learning off
auto br_default
iface br_default
bridge-ports bond3 peerlink vxlan48
hwaddress 44:38:39:22:01:ab
bridge-vlan-aware yes
bridge-vids 101 102
bridge-pvid 1
cumulus@border02:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.64/32
clagd-vxlan-anycast-ip 10.0.1.255
vxlan-local-tunnelip 10.10.10.64
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto swp53
iface swp53
auto swp54
iface swp54
auto bond3
iface bond3
mtu 9000
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow yes
clag-id 1
bridge-vids 101 102
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-priority 2000
clagd-backup-ip 10.10.10.63
clagd-sys-mac 44:38:39:BE:EF:FF
clagd-args --initDelay 10
auto vlan101
iface vlan101
address 10.1.101.65/24
address-virtual 00:00:00:00:00:01 10.1.101.1/24
hwaddress 44:38:39:22:01:b3
vrf RED
vlan-raw-device br_default
vlan-id 101
auto vlan102
iface vlan102
address 10.1.102.65/24
address-virtual 00:00:00:00:00:02 10.1.102.1/24
hwaddress 44:38:39:22:01:b3
vrf BLUE
vlan-raw-device br_default
vlan-id 102
auto vlan4024_l3
iface vlan4024_l3
vrf RED
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:FF
vlan-id 4024
auto vlan4036_l3
iface vlan4036_l3
vrf BLUE
vlan-raw-device br_default
address-virtual 44:38:39:BE:EF:FF
vlan-id 4036
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 4024=4001 4036=4002
bridge-vids 4024 4036
bridge-learning off
auto br_default
iface br_default
bridge-ports bond3 peerlink vxlan48
hwaddress 44:38:39:22:01:b3
bridge-vlan-aware yes
bridge-vids 101 102
bridge-pvid 1
This simulation starts with the example EVPN symmetric routing configuration. The demo is pre-configured using NVUE commands.
To validate the configuration, run the commands listed in the Troubleshooting EVPN section.
VXLAN Devices
Cumulus Linux supports both single and traditional VXLAN devices.
You can configure single VXLAN devices in VLAN-aware bridge mode only.
You cannot use a combination of single and traditional VXLAN devices.
A traditional VXLAN device configuration supports up to 2000 VNIs and a single VXLAN device configuration supports up to 4000 VNIs.
NVIDIA recommends you use single VXLAN devices instead of traditional VXLAN devices.
Single VXLAN Device
With a single VXLAN device, a set of VNIs represent a single device model. The single VXLAN device has a set of attributes that belong to the VXLAN construct. Individual VNIs include a VLAN to VNI mapping and you can specify which VLANs map to the associated VNIs. Single VXLAN device simplifies the configuration and reduces the overhead by replacing multiple traditional VXLAN devices with a single VXLAN device.
Cumulus Linux supports multiple single VXLAN devices when configured with multiple VLAN-aware bridges. You configure multiple single VXLAN devices in the same way you configure a single VXLAN device. Make sure not to duplicate VNIs across single VXLAN device configurations.
You can configure a single VXLAN device with NVUE or by manually editing the /etc/network/interfaces file.
When you configure a single VXLAN device with NVUE, Cumulus Linux creates a unique name for the device in the format vxlan<id>. Cumulus Linux generates the ID using the bridge name as the hash key.
The following static VXLAN example configuration:
Creates a single VXLAN device (vxlan48)
Maps VLAN 10 to VNI 10 and VLAN 20 to VNI 20
Adds the VXLAN device to the default bridge br_default
Sets the flooding multicast group for VNI 10 to 239.1.1.110 and the multicast group for VNI 20 to 239.1.1.120
Edit the /etc/network/interfaces file then run the ifreload -a command.
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto swp1
iface swp1
bridge-access 10
auto swp2
iface swp2
bridge-access 20
auto vxlan48
iface vxlan48
vxlan-mcastgrp-map 10=239.1.1.110 20=239.1.1.120
bridge-vlan-vni-map 10=10 20=20
bridge-vids 10 20
bridge-learning off
auto br_default
iface br_default
bridge-ports swp1 swp2 vxlan48
hwaddress 44:38:39:22:01:ab
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
cumulus@leaf01:~$ ifreload -a
Traditional VXLAN Device
With a traditional VXLAN device, each VNI is a separate device (for example, vni10, vni20, vni30).
You can configure traditional VXLAN devices by manually editing the /etc/network/interfaces file.
The following example configuration:
Creates two unique VXLAN devices (vni10 and vni20)
Adds each VXLAN device (vni10 and vni20) to the bridge bridge
Configures the local tunnel IP address to be the loopback address of the switch
You cannot use NVUE commands to configure traditional VXLAN devices.
Edit the /etc/network/interfaces file, then run the ifreload -a command.
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
vxlan-local-tunnelip 10.10.10.1
auto mgmt
iface mgmt
address 127.0.0.1/8
vrf-table auto
auto swp1
iface swp1
bridge-access 10
auto swp2
iface swp2
bridge-access 20
auto vni10
iface vni10
bridge-access 10
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 10
auto vni20
iface vni20
bridge-access 20
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 20
auto bridge
iface bridge
bridge-ports swp1 swp2 vni10 vni20
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
cumulus@leaf01:~$ ifreload -a
Automatic VLAN to VNI Mapping
In an EVPN VXLAN environment, you need to map individual VLANs to VNIs. For a single VXLAN device, you can do this with a separate NVUE command per VLAN; however, this can be cumbersome if you have to configure many VLANS or need to isolate tenants and reuse VLANs. To simplify the configuration, you can use these two commands instead:
nv set bridge domain <bridge> vlan <vlans> vni auto configures the specified VLANs to use automatic mapping.
nv set bridge domain <bridge> vlan-vni-offset configures the offset you want to use for the VNIs. For example, if you specify an offset of 10000, the VNI is the VLAN plus 10000.
The following commands automatically set the VNIs for VLAN 10, 20, 30, 40, and 50 on the default bridge (br_default) to 1000010, 1000020, 1000030, 1000040, and 1000050, and set the VNIs for VLAN 10, 20, 30, 40, and 50 on bridge br_01 to 2000010, 2000020, 2000030, 2000040, and 2000050:
cumulus@switch:mgmt:~$ nv set bridge domain br_default vlan 10,20,30,40,50 vni auto
cumulus@switch:mgmt:~$ nv set bridge domain br_default vlan-vni-offset 10000
cumulus@switch:mgmt:~$ nv set bridge domain br_01 vlan 10,20,30,40,50 vni auto
cumulus@switch:mgmt:~$ nv set bridge domain br_01 vlan-vni-offset 20000
cumulus@switch:mgmt:~$ nv config apply
You cannot use automatic NVUE VLAN to VNI mapping commands to configure static VXLAN tunnels.
The following configuration example configures VLANS 10, 20, and 30. The VLANs map automatically to VNIs with an offset of 10000.
cumulus@switch:mgmt:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@switch:mgmt:~$ nv set interface swp1-2 bridge domain br_default
cumulus@switch:mgmt:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@switch:mgmt:~$ nv set interface vlan10
cumulus@switch:mgmt:~$ nv set interface vlan20
cumulus@switch:mgmt:~$ nv set interface vlan30
cumulus@switch:mgmt:~$ nv set bridge domain br_default vlan 10,20,30 vni auto
cumulus@switch:mgmt:~$ nv set bridge domain br_default vlan-vni-offset 10000
cumulus@switch:mgmt:~$ nv config apply
cumulus@switch:mgmt:~$ sudo cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.1/32
vxlan-local-tunnelip 10.10.10.1
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:ab
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:ab
vlan-raw-device br_default
vlan-id 20
auto vlan30
iface vlan30
hwaddress 44:38:39:22:01:ab
vlan-raw-device br_default
vlan-id 30
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10010 20=10020 30=10030
bridge-learning off
auto br_default
iface br_default
bridge-ports swp1 swp2 vxlan48
hwaddress 44:38:39:22:01:ab
bridge-vlan-aware yes
bridge-vids 10 20 30
bridge-pvid 1
VXLAN UDP Port
You can change the UDP port that Cumulus Linux uses for VXLAN encapsulation. The default port is 4789.
The following example changes the UDP port for VXLAN encapsulation to 1024:
cumulus@switch:mgmt:~$ nv set nve vxlan port 1024
TC Filters
NVIDIA recommends you run TC filter commands on each VLAN interface on the VTEP to install rules to protect the UDP port that Cumulus Linux uses for VXLAN encapsulation against VXLAN hopping vulnerabilities. If you have VRR configured on the VLAN, add a similar rule for the VRR device.
The following example installs an IPv4 and an IPv6 filter on vlan10 to protect the default port 4879:
cumulus@switch:mgmt:~$ tc filter add dev vlan10 prio 1 protocol ip ingress flower ip_proto udp dst_port 4879 action drop
cumulus@switch:mgmt:~$ tc filter add dev vlan10 prio 2 protocol ipv6 ingress flower ip_proto udp dst_port 4879 action drop
The following example installs an IPv4 and an IPv6 filter on VRR device vlan10-v0 to protect port 4879:
cumulus@switch:mgmt:~$ tc filter add dev vlan10-v0 prio 1 protocol ip ingress flower ip_proto udp dst_port 4879 action drop
cumulus@switch:mgmt:~$ tc filter add dev vlan10-v0 prio 2 protocol ipv6 ingress flower ip_proto udp dst_port 4879 action drop
Related Information
For information about VXLAN devices and static VXLAN tunnels, see Static VXLAN Tunnels.
For information about VXLAN devices and EVPN, see EVPN.
VXLAN Routing
VXLAN routing, sometimes referred to as inter-VXLAN routing, provides IP routing between VXLAN VNIs in overlay networks. Cumulus Linux routes traffic using the inner header or the overlay tenant IP address.
Because VXLAN routing is fundamentally routing, you deploy it typically with a control plane, such as Ethernet Virtual Private Network (EVPN). You can also set up static routing for MAC distribution and BUM handling.
For a detailed description of different VXLAN routing models and configuration examples, refer to EVPN.
VXLAN routing supports full layer 3 multi-tenancy; all routing occurs in the context of a VRF. Also, VXLAN routing supports dual-attached hosts where the associated VTEPs function in active-active mode.
Static VXLAN Tunnels
Static VXLAN tunnels serve to connect two VTEPs in a given environment. Static VXLAN tunnels are the simplest deployment mechanism for small scale environments and are interoperable with other vendors that adhere to VXLAN standards. Because you map which VTEPs are in a particular VNI, you can avoid the tedious process of defining connections to every VLAN on every other VTEP on every other rack.
Cumulus Linux supports more than one VXLAN ID per VLAN-aware bridge but does not support more than one VXLAN ID per traditional bridge.
Configure Static VXLAN Tunnels
To configure static VXLAN tunnels, you create VXLAN devices. Cumulus Linux supports:
Traditional VXLAN devices, where you configure unique VXLAN devices and add each device to the bridge.
Single VXLAN devices, where all VXLAN tunnels with the same settings (local tunnel IP address and VXLAN remote IP addresses) can share the same VXLAN device and you only need to add the single VXLAN device to the bridge.
The configuration examples use the following topology. Each IP address corresponds to the loopback address of the switch.
Traditional VXLAN Device
The following traditional VXLAN device configuration:
Sets the loopback address on each leaf
Creates two unique VXLAN devices (vni10 and vni20)
Configures the local tunnel IP address to be the loopback address of the switch
Enables bridge learning on the each VXLAN device
Creates the tunnels on each VXLAN device by specifying the loopback addresses of the other leafs
Adds both VXLAN devices (vni10 and vni20) to the bridge called bridge
Cumulus Linux does not provide NVUE commands for traditional VXLAN device configuration.
Edit the /etc/network/interfaces file, then run the ifreload -a command.
auto lo
iface lo inet loopback
address 10.10.10.1/32
vxlan-local-tunnelip 10.10.10.1
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
This simulation starts with the example static VXLAN configuration. The demo is pre-configured using NVUE commands.
To validate the configuration, run the verification commands shown below.
The above NVUE commands specify a different flooding list for each VNI. If you want to set the same flooding list for all VNIs, you can use the nv set nve vxlan flooding head-end-replication command; for example:
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf01:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf01:~$ nv set nve vxlan mac-learning on
cumulus@leaf01:~$ nv set nve vxlan source address 10.10.10.1
cumulus@leaf01:~$ nv set nve vxlan flooding head-end-replication 10.10.10.2
cumulus@leaf01:~$ nv set nve vxlan flooding head-end-replication 10.10.10.3
cumulus@leaf01:~$ nv set nve vxlan flooding head-end-replication 10.10.10.4
cumulus@leaf01:~$ nv set interface swp1 bridge domain br_default access 10
cumulus@leaf01:~$ nv set interface swp2 bridge domain br_default access 20
cumulus@leaf01:~$ nv config apply
The above commands create this configuration in the /etc/network/interfaces file:
...
auto vxlan48
iface vxlan48
vxlan-remoteip-map 10=10.10.10.2 10=10.10.10.3 10=10.10.10.4 20=10.10.10.2 20=10.10.10.3 20=10.10.10.4
bridge-vlan-vni-map 10=10 20=20
bridge-learning on
...
Verify the Configuration
After you configure all the leafs, run the following command to check for replication entries. The command output is different for traditional and single VXLAN devices.
For traditional VXLAN devices:
cumulus@leaf01:~$ sudo bridge fdb show | grep 00:00:00:00:00:00
00:00:00:00:00:00 dev vni10 dst 10.10.10.3 self permanent
00:00:00:00:00:00 dev vni10 dst 10.10.10.2 self permanent
00:00:00:00:00:00 dev vni20 dst 10.10.10.4 self permanent
For a single VXLAN devices:
cumulus@leaf01:mgmt:~$ sudo bridge fdb show | grep 00:00:00:00:00:00
00:00:00:00:00:00 dev vxlan48 dst 10.10.10.2 src_vni 10 self permanent
00:00:00:00:00:00 dev vxlan48 dst 10.10.10.3 src_vni 10 self permanent
00:00:00:00:00:00 dev vxlan48 dst 10.10.10.4 src_vni 20 self permanent
Cumulus Linux disables bridge learning and enables ARP suppression by default on VXLAN interfaces. You can change the default behavior to set bridge learning on and ARP suppression off for all VNIs by creating a policy file called bridge.json in the /etc/network/ifupdown2/policy.d/ directory. For example:
After you create the file, run ifreload -a to load the new configuration.
VXLAN Active-active Mode
VXLAN active-active mode enables a pair of MLAG switches to act as a single VTEP, providing active-active VXLAN termination for bare metal as well as virtualized workloads.
To use VXLAN active-active mode, you need to configure:
To configure VXLAN active-active mode, you must provision each switch in an MLAG pair with a virtual IP address for VXLAN data-path termination. The VXLAN termination address is an anycast IP address that you configure under the loopback interface. With MLAG peering, both switches use the anycast IP address for VXLAN encapsulation and decapsulation. This enables remote VTEPs to learn the host MAC addresses attached to the MLAG switches against one logical VTEP, even though the switches independently encapsulate and decapsulate layer 2 traffic originating from the host.
MLAG dynamically adds and removes the anycast IP address as the loopback interface address as follows:
When the switches boot up, all VXLAN interfaces are in a PROTO_DOWN state. The anycast IP address is not in use.
MLAG peering takes place and a successful VXLAN interface consistency check between the switches occurs.
The clagd daemon adds the anycast address to the loopback interface as a second address. It then changes the local IP address of the VXLAN interface from a unique address to the anycast IP address and puts the interface in an UP state.
The active-active configuration for a given VXLAN interface must be consistent between both switches in the MLAG pair; MLAG ensures that the configuration is consistent before bringing up the VXLAN interfaces.
The anycast virtual IP address for VXLAN termination must be the same on both switches in the MLAG pair.
You must configure a VXLAN interface with the same VXLAN ID, which must be administratively up on both switches in the MLAG pair. Run the clagctl command to check if any VXLAN switches are in a PROTO_DOWN state.
If you use VXLAN active-active with EVPN symmetric mode, you must set the anycast MAC address on both switches in the MLAG pair; see Advertise Primary IP Address.
To configure the anycast IP address:
Run the nv set nve vxlan mlag shared-address command.
Add the clagd-vxlan-anycast-ip parameter under the loopback interface in the /etc/network/interfaces file:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
clagd-vxlan-anycast-ip 10.0.1.12
...
cumulus@leaf02:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.2/32
clagd-vxlan-anycast-ip 10.0.1.12
...
When you use EVPN with MLAG, EVPN might install local MAC addresses or neighbor entries as remote entries. To prevent EVPN from taking ownership of local MAC addresses or neighbor entries from MLAG, you can associate all local layer 2 VNIs with a unique site ID, which represents an MLAG pair. See Configure a Site ID for MLAG.
Troubleshooting
This section describes VXLAN active-active failure conditions and provides troubleshooting commands.
Failure Conditions
Failure Condition
Behavior
The peer link goes down.
The primary MLAG switch continues to keep all VXLAN interfaces up with the anycast IP address while the secondary switch brings down all VXLAN interfaces and places them in a PROTO_DOWN state. The secondary MLAG switch removes the anycast IP address from the loopback interface.
One of the switches goes down.
The other operational switch continues to use the anycast IP address.
clagd stops.
All VXLAN interfaces go in a PROTO_DOWN state. The switch removes the anycast IP address from the loopback interface and the local IP addresses of the VXLAN interfaces change from the anycast IP address to unique non-virtual IP addresses.
MLAG peering does not establish between the switches.
clagd brings up all the VXLAN interfaces after the reload timer expires with the configured anycast IP address. This allows the VXLAN interface to be up and running on both switches even though peering is not established.
The peer link goes down but the peer switch is up (the backup link is active).
All VXLAN interfaces go into a PROTO_DOWN state on the secondary switch.
The anycast IP address is different on the MLAG peers.
The VXLAN interface goes into a PROTO_DOWN state on the secondary switch.
Troubleshooting Commands
To show the MLAG configuration on the switch, run the NVUE nv show mlag command:
cumulus@leaf01:mgmt:~$ nv show mlag
operational applied description
-------------- ----------------------- ----------------- ------------------------------------------------------
enable on Turn the feature 'on' or 'off'. The default is 'off'.
debug off Enable MLAG debugging
init-delay 180 The delay, in seconds, before bonds are brought up.
mac-address 44:38:39:be:ef:aa 44:38:39:BE:EF:AA Override anycast-mac and anycast-id
peer-ip fe80::4638:39ff:fe00:5a linklocal Peer Ip Address
priority 32768 32768 Mlag Priority
[backup] 10.10.10.2 10.10.10.2 Set of MLAG backups
anycast-ip 10.0.1.12 Vxlan Anycast Ip address
backup-active True Mlag Backup Status
backup-reason Mlag Backup Reason
local-id 44:38:39:00:00:59 Mlag Local Unique Id
local-role primary Mlag Local Role
peer-alive True Mlag Peer Alive Status
peer-id 44:38:39:00:00:5a Mlag Peer Unique Id
peer-interface peerlink.4094 Mlag Peerlink Interface
peer-priority 32768 Mlag Peer Priority
peer-role secondary Mlag Peer Role
To show the MLAG neighbor information on the switch, run the NVUE nv show mlag neighbor command:
To show MLAG behavior and any inconsistencies between an MLAG pair, run the clagctl command.
In the following example, no conflicts exist for this MLAG interface and the VXLAN is up and running (there is no Proto-Down). The VXLAN anycast IP address shared by the MLAG pair for VTEP termination is in use and is 10.0.1.12.
cumulus@leaf01$ clagctl
The peer is alive
Our Priority, ID, and Role: 32768 44:38:39:00:00:59 primary
Peer Priority, ID, and Role: 32768 44:38:39:00:00:5a secondary
Peer Interface and IP: peerlink.4094 fe80::4638:39ff:fe00:5a (linklocal)
VxLAN Anycast IP: 10.0.1.12
Backup IP: 10.10.10.2 (active)
System MAC: 44:38:39:be:ef:aa
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
bond1 - 1 - -
vxlan48 vxlan48 - - -
In the following example, the primary switch has the wrong VXLAN anycast IP address configured. When you run the clagctl command on the secondary switch, the Proto-Down Reason shows anycast-ip-mismatch on bond01 and vxlan-single,anycast-ip-mismatch on vxlan48.
cumulus@leaf04:mgmt:~$ clagctl
The peer is alive
Our Priority, ID, and Role: 32768 44:38:39:00:00:5e secondary
Peer Priority, ID, and Role: 32768 44:38:39:00:00:5d primary
Peer Interface and IP: peerlink.4094 fe80::4638:39ff:fe00:5d (linklocal)
VxLAN Anycast IP: 10.0.1.34
Backup IP: 10.10.10.3 (active)
System MAC: 44:38:39:be:ef:bb
CLAG Interfaces
Our Interface Peer Interface CLAG Id Conflicts Proto-Down Reason
---------------- ---------------- ------- -------------------- -----------------
bond1 - 1 - anycast-ip-mismatch
vxlan48 - - - vxlan-single,anycast-ip-mismatch
Configuration Example
The commands in this example configure:
MLAG between leaf01 and leaf02, and between leaf03 and leaf04.
BGP unnumbered on all leafs and spines.
EVPN as the control plane for VXLAN between BGP neighbors.
A single VXLAN device (vxlan48) on each leaf. VLAN 10 maps to VNI 10 and VLAN 20 to VNI 20. The VXLAN device is part of the default bridge br_default.
The anycast IP address 10.0.1.12 on leaf01 and leaf02, and 10.0.1.34 on leaf03 and leaf04.
Layer 2 bonds that link server01 to leaf01 and leaf02, and server04 to leaf03 and leaf04. The example shows the server01 and server04 /etc/network/interfaces file configuration.
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp1,swp49-52
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond1 bridge domain br_default
cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:~$ nv set interface vlan10
cumulus@leaf01:~$ nv set interface vlan20
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf01:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf01:~$ nv set nve vxlan mlag shared-address 10.0.1.12
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp52 remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp neighbor peerlink.4094 remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.1/32
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf01:~$ nv set evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp52 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv set vrf default router bgp neighbor peerlink.4094 address-family l2vpn-evpn enable on
cumulus@leaf01:~$ nv config apply
cumulus@leaf02:~$ nv set interface lo ip address 10.10.10.2/32
cumulus@leaf02:~$ nv set interface swp1,swp49-52
cumulus@leaf02:~$ nv set interface bond1 bond member swp1
cumulus@leaf02:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf02:~$ nv set interface bond1 bridge domain br_default
cumulus@leaf02:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf02:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:~$ nv set mlag backup 10.10.10.1
cumulus@leaf02:~$ nv set mlag peer-ip linklocal
cumulus@leaf02:~$ nv set interface vlan10
cumulus@leaf02:~$ nv set interface vlan20
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10,20
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf02:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf02:~$ nv set nve vxlan mlag shared-address 10.0.1.12
cumulus@leaf02:~$ nv set router bgp autonomous-system 65102
cumulus@leaf02:~$ nv set router bgp router-id 10.10.10.2
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp52 remote-as external
cumulus@leaf02:~$ nv set vrf default router bgp neighbor peerlink.4094 remote-as external
cumulus@leaf02:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.2/32
cumulus@leaf02:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf02:~$ nv set evpn enable on
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp51 address-family l2vpn-evpn enable on
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp52 address-family l2vpn-evpn enable on
cumulus@leaf02:~$ nv set vrf default router bgp neighbor peerlink.4094 address-family l2vpn-evpn enable on
cumulus@leaf02:~$ nv config apply
cumulus@leaf03:~$ nv set interface lo ip address 10.10.10.3/32
cumulus@leaf03:~$ nv set interface swp1,swp49-52
cumulus@leaf03:~$ nv set interface bond1 bond member swp1
cumulus@leaf03:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf03:~$ nv set interface bond1 bridge domain br_default
cumulus@leaf03:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf03:~$ nv set mlag mac-address 44:38:39:BE:EF:BB
cumulus@leaf03:~$ nv set mlag backup 10.10.10.4
cumulus@leaf03:~$ nv set mlag peer-ip linklocal
cumulus@leaf03:~$ nv set interface vlan10
cumulus@leaf03:~$ nv set interface vlan20
cumulus@leaf03:~$ nv set bridge domain br_default vlan 10,20
cumulus@leaf03:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf03:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf03:~$ nv set nve vxlan mlag shared-address 10.0.1.34
cumulus@leaf03:~$ nv set router bgp autonomous-system 65103
cumulus@leaf03:~$ nv set router bgp router-id 10.10.10.3
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp52 remote-as external
cumulus@leaf03:~$ nv set vrf default router bgp neighbor peerlink.4094 remote-as external
cumulus@leaf03:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.3/32
cumulus@leaf03:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf03:~$ nv set evpn enable on
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp51 address-family l2vpn-evpn enable on
cumulus@leaf03:~$ nv set vrf default router bgp neighbor swp52 address-family l2vpn-evpn enable on
cumulus@leaf03:~$ nv set vrf default router bgp neighbor peerlink.4094 address-family l2vpn-evpn enable on
cumulus@leaf03:~$ nv config apply
cumulus@leaf04:~$ nv set interface lo ip address 10.10.10.4/32
cumulus@leaf04:~$ nv set interface swp1,swp49-52
cumulus@leaf04:~$ nv set interface bond1 bond member swp1
cumulus@leaf04:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf04:~$ nv set interface bond1 bridge domain br_default
cumulus@leaf04:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf04:~$ nv set mlag mac-address 44:38:39:BE:EF:BB
cumulus@leaf04:~$ nv set mlag backup 10.10.10.3
cumulus@leaf04:~$ nv set mlag peer-ip linklocal
cumulus@leaf04:~$ nv set interface vlan10
cumulus@leaf04:~$ nv set interface vlan20
cumulus@leaf04:~$ nv set bridge domain br_default vlan 10,20
cumulus@leaf04:~$ nv set bridge domain br_default vlan 10 vni 10
cumulus@leaf04:~$ nv set bridge domain br_default vlan 20 vni 20
cumulus@leaf04:~$ nv set nve vxlan mlag shared-address 10.0.1.34
cumulus@leaf04:~$ nv set router bgp autonomous-system 65104
cumulus@leaf04:~$ nv set router bgp router-id 10.10.10.4
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp52 remote-as external
cumulus@leaf04:~$ nv set vrf default router bgp neighbor peerlink.4094 remote-as external
cumulus@leaf04:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.4/32
cumulus@leaf04:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf04:~$ nv set evpn enable on
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp51 address-family l2vpn-evpn enable on
cumulus@leaf04:~$ nv set vrf default router bgp neighbor swp52 address-family l2vpn-evpn enable on
cumulus@leaf04:~$ nv set vrf default router bgp neighbor peerlink.4094 address-family l2vpn-evpn enable on
cumulus@leaf04:~$ nv config apply
cumulus@spine01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp1-4
cumulus@spine01:~$ nv set router bgp autonomous-system 65199
cumulus@spine01:~$ nv set router bgp router-id 10.10.10.101
cumulus@spine01:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp1 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp2 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp3 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp4 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp1 address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp2 address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp3 address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp4 address-family l2vpn-evpn enable on
cumulus@spine01:~$ nv config apply
cumulus@spine02:~$ nv set interface lo ip address 10.10.10.102/32
cumulus@spine02:~$ nv set interface swp1-4
cumulus@spine02:~$ nv set router bgp autonomous-system 65199
cumulus@spine02:~$ nv set router bgp router-id 10.10.10.102
cumulus@spine02:~$ nv set vrf default router bgp peer-group underlay remote-as external
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp1 remote-as external
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp2 remote-as external
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp3 remote-as external
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp4 remote-as external
cumulus@spine02:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp1 address-family l2vpn-evpn enable on
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp2 address-family l2vpn-evpn enable on
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp3 address-family l2vpn-evpn enable on
cumulus@spine02:~$ nv set vrf default router bgp neighbor swp4 address-family l2vpn-evpn enable on
cumulus@spine02:~$ nv config apply
cumulus@leaf01:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
clagd-vxlan-anycast-ip 10.0.1.12
vxlan-local-tunnelip 10.10.10.1
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 1
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-backup-ip 10.10.10.2
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 180
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 peerlink vxlan48
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
cumulus@leaf02:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.2/32
clagd-vxlan-anycast-ip 10.0.1.12
vxlan-local-tunnelip 10.10.10.2
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 1
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-backup-ip 10.10.10.1
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 180
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 peerlink vxlan48
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
cumulus@leaf03:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.3/32
clagd-vxlan-anycast-ip 10.0.1.34
vxlan-local-tunnelip 10.10.10.3
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 1
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-backup-ip 10.10.10.4
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 180
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:bb
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:bb
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 peerlink vxlan48
hwaddress 44:38:39:22:01:bb
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
cumulus@leaf04:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.4/32
clagd-vxlan-anycast-ip 10.0.1.34
vxlan-local-tunnelip 10.10.10.4
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 1
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-backup-ip 10.10.10.3
clagd-sys-mac 44:38:39:BE:EF:BB
clagd-args --initDelay 180
auto vlan10
iface vlan10
hwaddress 44:38:39:22:01:c1
vlan-raw-device br_default
vlan-id 10
auto vlan20
iface vlan20
hwaddress 44:38:39:22:01:c1
vlan-raw-device br_default
vlan-id 20
auto vxlan48
iface vxlan48
bridge-vlan-vni-map 10=10 20=20
bridge-learning off
auto br_default
iface br_default
bridge-ports bond1 peerlink vxlan48
hwaddress 44:38:39:22:01:c1
bridge-vlan-aware yes
bridge-vids 10 20
bridge-pvid 1
cumulus@spine01:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
cumulus@spine02:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.102/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
auto lo
iface lo inet loopback
auto lo
iface lo inet static
address 10.0.0.31/32
To validate the configuration, run the commands shown in the troublshooting section above.
For a full EVPN symmetric active-active configuration example, see Configuration Examples.
Bridge Layer 2 Protocol Tunneling
A VXLAN connects layer 2 domains across a layer 3 fabric; however, layer 2 protocol packets, such as LLDP, LACP, STP, and CDP stop at the ingress VTEP. If you want the VXLAN to behave more like a wire or hub, where the switch tunnels protocol packets instead of terminating them locally, you can enable bridge layer 2 protocol tunneling.
Configure Bridge Layer 2 Protocol Tunneling
To configure bridge layer 2 protocol tunneling for all protocols:
Cumulus Linux does not provide NVUE commands for this setting.
Add bridge-l2protocol-tunnel all to the interface stanza and the VNI stanza of the /etc/network/interfaces file:
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp1
iface swp1
bridge-access 10
bridge-l2protocol-tunnel all
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
...
interface vni10
bridge-access 10
bridge-l2protocol-tunnel all
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 10
vxlan-local-tunnelip 10.10.10.1
To configure bridge layer 2 protocol tunneling for a specific protocol, such as LACP:
Cumulus Linux does not provide NVUE commands for this configuration.
Add bridge-l2protocol-tunnel <protocol> to the interface stanza and the VNI stanza of the /etc/network/interfaces file:
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp1
iface swp1
bridge-access 10
bridge-l2protocol-tunnel lacp
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
...
interface vni10
bridge-access 10
bridge-l2protocol-tunnel lacp
bridge-learning off
mstpctl-bpduguard yes
mstpctl-portbpdufilter yes
vxlan-id 10
vxlan-local-tunnelip 10.10.10.1
You must enable layer 2 protocol tunneling on the VXLAN link in addition to the interface so that the packets get bridged and forwarded correctly.
LLDP Example
Here is another example configuration for Link Layer Discovery Protocol. You can verify the configuration with lldpcli.
cumulus@switch:~$ sudo lldpcli show neighbors
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface: swp23, via LLDP, RID: 13, TIme: 0 day, 00:58:20
Chassis:
ChassisID: mac e4:1d:2d:f7:d5:52
SysName: H1
MgmtIP: 10.0.2.207
MgmtIP: fe80::e61d:2dff:fef7:d552
Capability: Bridge, off
Capability: Router, on
Port:
PortID: ifname swp14
PortDesc: swp14
TTL: 120
PMD autoneg: support: yes, enabled: yes
Adv: 1000Base-T, HD: no, FD: yes
MAU oper type: 40GbaseCR4 - 40GBASE-R PCS/PMA over 4 lane shielded copper balanced cable
...
LACP Example
H2 bond0:
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer 3+4(1)
802.3ad: info
LACP rate: fast
Min links: 1
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: cc:37:ab:e7:b5:7e
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Slave Interface: eth0
...
details partner lacp pdu:
system priority: 65535
system MAC address: 44:38:39:00:a4:95
...
Slave Interface: eth1
...
details partner lacp pdu:
system priority: 65535
system MAC address: 44:38:39:00:a4:95
Pseudowire Example
In this example:
Only two VTEPs are in the VXLAN. VTEP1 and VTEP2 point to each other as the only remote VTEP.
The bridge on each VTEP is in 802.1ad mode.
The host interface is an 802.1Q VLAN trunk.
The setting for bridge-l2protocol-tunnel is all.
The VTEP host-facing port is in access mode and the PVID maps to the VNI.
Considerations
Use caution when enabling bridge layer 2 protocol tunneling:
Layer 2 protocol tunneling is not a full-featured pseudo-wire solution; End-to-end link status tracking or feedback does not exist.
Layer 2 protocols typically run on a link-local scope. Running the protocols through a tunnel across a layer 3 fabric incurs higher latency, which require you to tune protocol timers.
The lack of end to end link or tunnel status feedback and the higher protocol timeout values make for a higher protocol convergence time when there are changes.
If the remote endpoint is a Cisco endpoint using LACP, you must configure etherchannel misconfig guard on the Cisco device.
VXLAN Tunnel DSCP Operations
Cumulus Linux provides configuration options to control DSCP operations during VXLAN encapsulation and decapsulation, specifically for solutions that require end-to-end quality of service, such as RDMA over Converged Ethernet.
The configuration options propagate ECN between the underlay and overlay according to RFC 6040, which describes how to construct the IP header of an ECN field on both ingress to and egress from an IP-in-IP tunnel.
Configure DSCP Operations
You can set the following DSCP operations:
The VXLAN encapsulation DSCP action. The action can be copy if the inner packet is IP, set to configure a specific value, or derive to derive the value from the switch priority. The default setting is derive.
The VXLAN decapsulation DSCP or COS action. The action can be copy to copy the DSCP value from the outer packet (underlay), preserve to keep the values configured in the inner packet (overlay), or derive to derive the value from the switch priority. The default setting is derive.
The following example sets the VXLAN encapsulation DSCP action to copy.
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Show the DSCP Setting
To show the VXLAN encapsulation DSCP setting, run the nv show nve vxlan encapsulation dscp command:
You can only set the VXLAN encapsulation and decapsulation DSCP actions globally. Cumulus Linux does not support per-VXLAN or per-tunnel settings.
QinQ and VXLANs
QinQ is an amendment to the IEEE 802.1Q specification that enables you to insert multiple VLAN tags into a single Ethernet frame.
QinQ with VXLAN is typically used by a service provider who offers multi-tenant layer 2 connectivity between different customer data centers over a virtualized layer 3 provider network. The customer VLANs are transparent to the provider network.
Cumulus Linux supports the standard 802.1ad with a VLAN-aware bridge where you map a customer (S-tag) to a VNI and preserve the inner VLAN (C-tag) inside a VXLAN packet.
Cumulus Linux also supports a special case with a VLAN-unaware bridge where you use both the S-tag, C-tag tuple for forwarding lookup and mapping to a VNI. Cumulus Linux removes both the S-tag and C-tag during VXLAN encapsulation; Cumulus Linux refers to this configuration as Double Tag Translation.
You must disable ARP and ND suppression on VXLAN bridges when using QinQ.
802.1ad with a VLAN-aware Bridge
In the standard 802.1ad QinQ model, the customer-facing interface is a QinQ access port and the outer S-tag is the PVID representing the customer. Cumulus Linux translates the S-tag to a VXLAN VNI. The inner C-tag is transparent to the provider. It is also possible that the provider has VLAN trunks connected to the same bridge, carrying traffic from different customers on the same port. In this case, the S-tag maps to a VNI. Cumulus Linux removes the S-tag during VXLAN encapsulation and adds it after decapsulation.
An example configuration in VLAN-aware bridge mode looks like this:
You configure two switches: one at the service provider edge that faces the customer (the switch on the left above), and one on the remote provider edge with a VLAN trunk (the switch on the right above).
All edges must support QinQ with VXLANs.
You cannot mix 802.1Q and 802.1ad subinterfaces on the same switch port.
When configuring bridges in traditional mode, all VLANs that are members of the same switch port must use the same vlan_protocol.
Configure the peerlink (peerlink.4094) between the MLAG pair for VLAN protocol 802.1ad.
You cannot use the peerlink as a backup datapath in case one of the MLAG peers loses all uplinks.
When the bridge VLAN protocol is 802.1ad and is VXLAN-enabled, all bridge ports must be either access ports (except for the MLAG peerlink) or VLAN trunks.
Remote Provider Edge Switch
For the switch facing the remote provider cloud:
Configure the bridge with vlan_protocol set to 802.1ad.
The VNI maps back to S-tag (customer).
A trunk port connected to the public cloud is the QinQ trunk and packets are double tagged, where the S-tag is for the customer and the C-tag is for the service.
To configure the remote provider switch:
Cumulus Linux does not provide NVUE commands for this configuration.
Edit the /etc/network/interfaces file to add the following configuration:
This example shows a configuration for 802.1ad QinQ in traditional bridge mode on a leaf.
▼
Example /etc/network/interfaces File
auto swp3.11
iface swp3.11
vlan-protocol 802.1ad
auto vxlan1000101
iface vxlan1000101
vxlan-id 1000101
vxlan-local-tunnelip 10.0.0.13
auto br11
iface br11
bridge-ports swp3.11 vxlan1000101
Double Tag Translation
Double tag translation includes a bridge with double-tagged member interfaces, where a combination of the C-tag and S-tag map to a VNI. You create the configuration only at the edge facing the public cloud. The VXLAN configuration at the customer-facing edge does not need to change.
The double tag is always a cloud connection. The customer-facing edge is either single-tagged or untagged. At the public cloud handoff point, the VNI maps to double VLAN tags, with the S-tag indicating the customer and the C-tag indicating the service.
The configuration in Cumulus Linux uses the outer tag for the customer and the inner tag for the service.
You can use double tag translation:
On Spectrum-2 and Spectrum-3 switches in a VXLAN configuration on native interfaces only. You cannot configure double tag translation on bonds.
ACL resources internally, which can increase ACL resource utilization. To see the number of ACL entries used, run the sudo cat /cumulus/switchd/run/acl_info/iacl_resource command.
Internal VLANs for each traditional-mode bridge, which has a default range of 275. To change the range, edit the /etc/cumulus/switchd.conf file to uncomment the #resv_vlan_range = 3725-3999 line and specify the range you want to use.
To configure a double-tagged interface, stack the VLANs as <port>.<outer tag>.<inner tag>. For example, swp1.100.10, where the outer tag is VLAN 100, which represents the customer, and the inner tag is VLAN 10, which represents the service.
An example configuration:
NVUE does not support double tag translation.
To configure the switch for double tag translation using the above example, edit the /etc/network/interfaces file in a text editor and add the following:
auto swp3.100.10
iface swp3.100.10
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
auto vni1000
iface vni1000
vxlan-local-tunnelip 10.0.0.1
mstpctl-portbpdufilter yes
mstpctl-bpduguard yes
vxlan-id 1000
auto custA-10-azr
iface custA-10-azr
bridge-ports swp3.100.10 vni1000
bridge-vlan-aware no
To check the configuration, run the brctl show command:
cumulus@switch:~$ sudo brctl show
bridge name bridge id STP enabled interfaces
custA-10-azr 8000.00020000004b yes swp3.100.10
vni1000
custB-20-azr 8000.00020000004b yes swp3.200.20
vni3000
Considerations
The Linux kernel limits interface names to 15 characters in length, which can be a problem for QinQ interfaces. To work around this issue, create two VLANs as nested VLAN raw devices, one for the outer tag and one for the inner tag. For example, you cannot create an interface called swp50s0.1001.101 because it contains 16 characters. Instead, edit the /etc/network/interfaces file to create VLANs with IDs 1001 and 101:
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto vlan1001
iface vlan1001
vlan-id 1001
vlan-raw-device swp50s0
auto vlan1001-101
iface vlan1001-101
vlan-id 101
vlan-raw-device vlan1001
auto bridge101
iface bridge101
bridge-ports vlan1001-101 vxlan1000101
...
Layer 3
This section describes layer 3 configuration. Read this section to understand routing protocols and learn how to configure routing on the Cumulus Linux switch:
Network routing is the process of selecting a path across one or more networks. When the switch receives a packet, it reads the packet headers to find out its intended destination. It then determines where to route the packet based on information in its routing tables, which can be static or dynamic.
Cumulus Linux supports both Static Routing, where you enter routes and specify the next hop manually and dynamic routing such as BGP, and OSP, where you configure a routing protocol on your switch and the routing protocol learns about other routers automatically.
You can use static routing if you do not require the complexity of a dynamic routing protocol (such as BGP or OSPF), if you have routes that do not change frequently and for which the destination is only one or two paths away.
With static routing, you configure the switch manually to send traffic with a specific destination prefix to a specific next hop. When the switch receives a packet, it looks up the destination IP address in the routing table and forwards the packet accordingly.
Configure a Static Route
Cumulus Linux adds static routes to the FRR routing table and then to the kernel routing table.
The following example commands configure Cumulus Linux to send traffic with the destination prefix 10.10.10.101/32 out swp51 (10.0.1.1/31) to the next hop 10.0.1.0.
cumulus@leaf01:~$ nv set interface swp1 ip address 10.0.1.1/31
cumulus@leaf01:~$ nv set vrf default router static 10.10.10.101/32 via 10.0.1.0
cumulus@leaf01:~$ nv config apply
Edit the /etc/network/interfaces file to configure an IP address for the interface on the switch that sends out traffic. For example:
The vtysh commands save the static route configuration in the /etc/frr/frr.conf file. For example:
...
!
ip route 10.10.10.101/32 10.0.1.0
!
...
The following example commands configure Cumulus Linux to send traffic with the destination prefix 10.10.10.61/32 out swp3 (10.0.0.32/31) to the next hop 10.0.0.33 in vrf BLUE.
cumulus@border01:~$ nv set interface swp3 ip address 10.0.0.32/31
cumulus@border01:~$ nv set interface swp3 ip vrf BLUE
cumulus@border01:~$ nv set vrf BLUE router static 10.10.10.61/32 via 10.0.0.33
cumulus@border01:~$ nv config apply
Edit the /etc/network/interfaces file to configure an IP address for the interface on the switch that sends out traffic. For example:
cumulus@border01:~$ sudo nano /etc/network/interfaces
...
auto swp3
iface swp3
address 10.0.0.32/31
vrf BLUE
...
Run vtysh commands to configure the static route (the destination prefix and next hop). For example:
cumulus@border01:~$ sudo vtysh
border01# configure terminal
border01(config)# vrf BLUE
border01(config-vrf)# ip route 10.10.10.61/32 10.0.0.33
border01(config-vrf)# end
border01# write memory
border01# exit
cumulus@border01:mgmt:~$
The vtysh commands save the static route configuration in the /etc/frr/frr.conf file. For example:
...
vrf BLUE
ip route 10.10.10.61/32 10.0.0.33
...
cumulus@leaf01:~$ sudo vtysh
leaf01# configure terminal
leaf01(config)# no ip route 10.10.10.101/32 10.0.1.0
leaf01(config)# exit
leaf01# write memory
leaf01# exit
cumulus@leaf01:~$
To view static routes, run the vtysh show ip route command. For example:
cumulus@leaf01:mgmt:~$ sudo vtysh
leaf01# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route
S>* 10.10.10.101/32 [1/0] via 10.0.1.0, swp51, weight 1, 00:02:07
You can also create a static route by adding the route to a switch port configuration. For example:
Cumulus Linux does not provide NVUE commands for this configuration.
Edit the /etc/network/interfaces file and add the following post-up and post-down lines to the interface stanza:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto swp51
iface swp51
address 10.0.1.1/31
post-up ip route add 10.10.10.101/32 via 10.0.1.0
post-down ip route del 10.10.10.101/32 via 10.0.1.0
The ip route command allows you to manipulate the kernel routing table directly from the Linux shell. See man ip(8) for details. FRR monitors the kernel routing table changes and updates its own routing table accordingly.
Configure a Gateway or Default Route
On each switch, consider creating a gateway or default route for traffic destined outside the switch’s subnet or local network. All such traffic passes through the gateway, which is a system on the same network that routes packets to their destination beyond the local network.
The following example configures the default route 0.0.0.0/0, which indicates that you can send any IP address to the gateway. The gateway is another switch with the IP address 10.0.1.0.
cumulus@leaf01:~$ nv set vrf default router static 0.0.0.0/0 via 10.0.1.0
cumulus@leaf01:~$ nv config apply
Instead of 0.0.0.0/0, you can specify default or default6.
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
!
ip route 0.0.0.0/0 10.0.1.0
!
...
The default route created by the gateway parameter in ifupdown2 does not install in FRR and does not redistribute into other routing protocols. See ifupdown2 and the gateway Parameter for more information.
Considerations
Deleting Routes through the Linux Shell
To avoid incorrect routing, do not use the Linux shell to delete static routes that you added with vtysh commands. Delete the routes with the vtysh commands.
IPv4 and IPv6 Neighbor Cache Aging Timer
Cumulus Linux does not support different neighbor cache aging timer settings for IPv4 and IPv6.
The net.ipv4.neigh.default.base_reachable_time_ms and net.ipv6.neigh.default.base_reachable_time_ms settings in the /etc/sysctl.d/neigh.conf file must have the same value:
Cumulus Linux advertises the maximum number of route table entries supported on the switch, including:
Layer 3 IPv4 LPM entries that have a mask less than /32
Layer 3 IPv6 LPM entries that have a mask of /64 or less
Layer 3 IPv6 LPM entries that have a mask greater than /64
Layer 3 IPv4 neighbor (or host) entries that are the next hops seen in ip neighbor
Layer 3 IPv6 neighbor entries that are the next hops seen in ip -6 neighbor
ECMP next hops, which are IP address entries in the routing table that specify the next closest or most optimal router in its routing path
MAC addresses
To determine the current table sizes on a switch, use cl-resource-query.
Supported Route Entries
Cumulus Linux provides several generalized profiles, described below. These profiles work only with layer 2 and layer 3 unicast forwarding.
The following tables list the number of MAC addresses, layer 3 neighbors, and LPM routes validated for each forwarding table profile. If you do not specify any profiles as described below, the switch uses the default values.
The values provided in the profiles below are the maximum values that Cumulus Linux software allocates; the theoretical hardware limits might be higher. These limits refer to values that have been validated as part of the unidimensional scale validation. If you try to achieve maximum scalability with multiple features enabled, results might differ from the values listed in this guide.
Spectrum 1
Profile
MAC Addresses
Layer 3 Neighbors
LPM
default
40k
32k (IPv4) and 16k (IPv6)
64k (IPv4) and 28k (IPv6-long)
l2-heavy
88k
48k (IPv4) and 40k (IPv6)
8k (IPv4) and 8k (IPv6-long)
l2-heavy-1
180K
8k (IPv4) and 8k (IPv6)
8k (IPv4) and 8k (IPv6-long)
v4-lpm-heavy
8k
8k (IPv4) and 16k (IPv6)
80k (IPv4) and 16k (IPv6-long)
v4-lpm-heavy-1
8k
8k (IPv4) and 2k (IPv6)
176k (IPv4) and 2k (IPv6-long)
v6-lpm-heavy
40k
8k (IPv4) and 40k (IPv6)
8k (IPv4), 32k (IPv6-long) and 32K (IPv6/64)
lpm-balanced
8k
8k (IPv4) and 8k (IPv6)
60k (IPv4), 60k (IPv6-long) and 120k (IPv6/64)
Spectrum-2 and Spectrum-3
Profile
MAC Addresses
Layer 3 Neighbors
LPM
default
50k
41k (IPv4) and 20k (IPv6)
82k (IPv4), 74k (IPv6-long), 1K (IPv4-Mcast)
l2-heavy
115k
74k (IPv4) and 37k (IPv6)
16k (IPv4), 24k (IPv6-long), 1K (IPv4-Mcast)
l2-heavy-1
239K
16k (IPv4) and 12k (IPv6)
16k (IPv4), 16k (IPv6-long), 1K (IPv4-Mcast)
l2-heavy-3
107k
90k (IPv4) and 80k (IPv6)
25k (IPv4), 10k (IPv6-long), 1K (IPv4-Mcast)
v4-lpm-heavy
16k
41k (IPv4) and 24k (IPv6)
124k (IPv4), 24k (IPv6-long), 1K (IPv4-Mcast)
v4-lpm-heavy-1
16k
16k (IPv4) and 4k (IPv6)
256k (IPv4), 8k (IPv6-long), 1K (IPv4-Mcast)
v6-lpm-heavy
16k
16k (IPv4) and 62k (IPv6)
16k (IPv4), 99k (IPv6-long), 1K (IPv4-Mcast)
v6-lpm-heavy-1
5k
4k (IPv4) and 4k (IPv6)
90k (IPv4), 235k (IPv6-long), 1K (IPv4-Mcast)
lpm-balanced
16k
16k (IPv4) and 12k (IPv6)
124k (IPv4), 124k (IPv6-long), 1K (IPv4-Mcast)
ipmc-heavy
57k
41k (IPv4) and 20k (IPv6)
82K (IPv4), 66K (IPv6-long), 8K (IPv4-Mcast)
ipmc-max
41K
41k (IPv4) and 20k (IPv6)
74K (IPv4), 66K (IPv6-long), 15K (IPv4-Mcast)
The IPv6 number corresponds to the /64 IPv6 prefix. The /128 IPv6 prefix number is half of the /64 IPv6 prefix number.
For the ipmc-max profile, the cl-resource-query command output displays 33K instead of 15K as the maximum number of IPv4 multicast routes in switchd. 15K is the supported and validated value. You can use the higher value of 33K to test higher multicast scale in non-production environments.
Change Forwarding Resource Profiles
You can set the profile that best suits your network architecture.
Run the nv set system forwarding profile <profile-name> command to specify the profile you want to use.
The following example command sets the l2-heavy profile:
cumulus@switch:~$ nv set system forwarding profile l2-heavy
cumulus@switch:~$ nv config apply
Instead of the above command, you can run the nv set system forwarding profile default command to set the profile back to the default.
Specify the profile you want to use with the forwarding_table.profile variable in the /etc/cumulus/datapath/traffic.conf file. The following example specifies l2-heavy:
After you specify a different profile, restart switchd for the change to take effect.
To show the different forwarding profiles that your switch supports and the MAC address, layer 3 neighbor, and LPM scale availability for each forwarding profile, run the nv show system forwarding profile-option command.
TCAM Profiles - Spectrum 1
Specify the profile you want to use with the tcam_resource.profile variable in the /etc/mlx/datapath/tcam_profile.conf file. The following example specifies ipmc-max:
After you specify a different profile, restart switchd for the change to take effect.
When you enable nonatomic updates (acl.non_atomic_update_mode is TRUE in the /etc/cumulus/switchd.conf file), the maximum number of mroute and ACL entries for each profile are:
Profile
Mroute Entries
ACL Entries
default
1000
500 (IPv6) or 1000 (IPv4)
ipmc-heavy
8500
1000 (IPv6) or 1500 (IPv4)
acl-heavy
450
2000 (IPv6) or 3500 (IPv4)
ipmc-max
13000
1000 (IPv6) or 2000 (IPv4)
When you disable nonatomic updates (acl.non_atomic_update_mode is FALSE in the /etc/cumulus/switchd.conf file), the maximum number of mroute and ACL entries for each profile are:
Profile
Mroute Entries
ACL Entries
default
1000
250 (IPv6) or 500 (IPv4)
ipmc-heavy
8500
500 (IPv6) or 750 (IPv4)
acl-heavy
450
1000 (IPv6) or 1750 (IPv4)
ipmc-max
13000
500 (IPv6) or 1000 (IPv4)
Route Filtering and Redistribution
Route filtering lets you exclude routes that neighbors advertise or receive. You can use route filtering to manipulate traffic flows, reduce memory utilization, and improve security.
This section discusses the following route filtering methods:
Prefix lists
Route maps
Route redistribution
Route map and prefix list names must start with a letter and can contain letters, digits, underscores and dashes. For example, you can name a route map MAP10 or ROUTE-MAP_10 but you cannot name a route map 10 or 10_ROUTE-MAP.
Prefix Lists
Prefix lists are access lists for route advertisements that match routes instead of traffic. Prefix lists are typically used with route maps and other filtering methods. A prefix list can match the prefix (the network itself) and the prefix length (the length of the subnet mask).
The following example commands configure a prefix list that permits all prefixes in the range 10.0.0.0/16 with a subnet mask less than or equal to /30. For networks 10.0.0.0/24, 10.10.10.0/24, and 10.0.0.10/32, only 10.0.0.0/24 matches (10.10.10.0/24 has a different prefix and 10.0.0.10/32 has a greater subnet mask).
cumulus@switch:~$ nv set router policy prefix-list prefixlist1 rule 1 match 10.0.0.0/16 max-prefix-len 30
cumulus@switch:~$ nv set router policy prefix-list prefixlist1 rule 1 action permit
cumulus@switch:~$ nv config apply
For IPv6, you need to run an additional command to set the prefix list type to IPv6. For example:
cumulus@switch:~$ nv set router policy prefix-list prefixlistipv6 type ipv6
cumulus@switch:~$ nv set router policy prefix-list prefixlistipv6 rule 1 match 2001:100::1/64
cumulus@switch:~$ nv set router policy prefix-list prefixlistipv6 rule 1 action permit
cumulus@switch:~$ nv config apply
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
cumulus@switch:~$ sudo cat /etc/frr/frr.conf
...
ip prefix-list prefixlist1 seq 1 permit 10.0.0.0/16 le 30
route-map MAP1 permit 10
match ip address prefix-list prefixlist1
Route Maps
Route maps are routing policies that Cumulus Linux considers before the router examines the forwarding table. Each statement in a route map has a sequence number, and includes a series of match and set statements. The route map parses from the lowest sequence number to the highest, and stops when there is a match.
Cumulus Linux supports several match and set statements. For example, you can match on an interface, prefix length, next hop or BGP AS path list. You can set the BGP metric, local-preference on routes, source IP, or the tag on the matched route. For a list of supported match and set statements, see Match and Set Statements below.
Configure a Route Map
To configure a route map:
Specify one or more conditions that must match and, optionally, one or more set actions to set or modify attributes of the route. If a route map does not specify any matching conditions, it always matches.
Specify the matching policy: permit (if the entry matches, carry out the set actions) or deny (if the entry matches, deny the route).
The following example commands configure a route map that sets the BGP metric to 50 for interface swp51:
cumulus@switch:~$ nv set router policy route-map routemap1 rule 10 match interface swp51
cumulus@switch:~$ nv set router policy route-map routemap1 rule 10 set metric 50
cumulus@switch:~$ nv set router policy route-map routemap1 rule 10 action permit
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo vtysh
switch# configure terminal
switch(config)# route-map routemap1 permit 10
switch(config-route-map)# match interface swp51
switch(config-route-map)# set metric 50
switch(config-route-map)# end
switch# write memory
switch# exit
cumulus@switch:~$
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
cumulus@switch:~$ sudo cat /etc/frr/frr.conf
...
route-map routemap1 permit 10
match interface swp51
set metric 50
The following example commands configure a route map to match the prefixes defined in prefixlist1 and set the nexth hop to 10.10.10.5:
cumulus@switch:~$ nv set router policy route-map routemap1 rule 10 match ip-prefix-list prefixlist1
cumulus@switch:~$ nv set router policy route-map routemap1 rule 10 set ip-nexthop 10.10.10.5
cumulus@switch:~$ nv set router policy route-map routemap1 rule 10 action permit
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo vtysh
switch# configure terminal
switch(config)# route-map routemap1 permit 10
switch(config-route-map)# match ip route-source prefix-list prefixlist1
switch(config-route-map)# set ip next-hop 10.10.10.5
switch(config-route-map)# end
switch# write memory
switch# exit
cumulus@switch:~$
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
cumulus@switch:~$ sudo cat /etc/frr/frr.conf
...
route-map routemap1 permit 10
match ip route-source prefix-list prefixlist1
set ip next-hop 10.10.10.5
The following example commands configure a route map to set the local-preference on routes to 400:
cumulus@switch:~$ nv set router policy route-map routemap2 rule 10 set local-preference 400
cumulus@switch:~$ nv set router policy route-map routemap2 rule 10 action permit
cumulus@switch:~$ nv config apply
Cumulus Linux supports the following match and set statements.
You can use the following list of supported match and set statements with NVUE commands. For a list of the match and set statements that vtysh supports, see the FRRouting User Guide.
Match
Description
as-path-list
Matches the specified AS path list.
interface
Matches the specified interface.
ip-prefix-len
Matches the specified prefix length.
origin
Matches the specified BGP origin. You can specify egp, igp, or incomplete.
type
Matches the specified route type, such as IPv4 or IPv6.
community-list
Matchest the specified community list.
ip-nexthop
Matches the specified next hop.
ip-prefix-list
Matches the specified prefix list.
peer
Matches the specified BGP neighbor.
evpn-default-route
Matches the EVPN default route. You can specify on or off.
ip-nexthop-len
Matches the specified next hop prefix length.
large-community-list
Matches the specified large community list.
source-protocol
Matches the specified source protocol, such as BGP, OSPF or static. NVUE does not support source protocol match.
evpn-route-type
Matches the specified EVPN route type. You can specify macip, imet, or prefix.
ip-nexthop-list
Matches the specified next hop list.
local-preference
Matches the specified local preference. You can specify a value between 0 and 4294967295.
source-vrf
Matches the specified source VRF.
evpn-vni
Matches the specified EVPN VNI.
ip-nexthop-type
Matches the specified next hop type, such as blackhole.
metric
Matches the specified BGP metric.
tag
Matches the specified tag value associated with the route. You can specify a value between 1 and 4294967295.
The source-protocol match statement is supported in zebra and BGP. Cumulus Linux does not support the match source-protocol statement in route maps configured for other routing protocols, such as OSPF.
Set
Description
aggregator-as
Sets the aggregator AS.
ext-community-rt
Sets the BGP extended community RT.
originator-id
Sets the originator ID so that BGP chooses the preferred path.
as-path-exclude
Sets BGP AS path exclude attribute to avoid considering the AS path during best path route selection.
ext-community-soo
Sets the BGP extended community Sight of Origin (SOO).
large-community
Sets the BGP large community.
source-ip
Sets the source IP address.
as-path-prepend
Sets the BGP AS path prepend attribute.
forwarding-address
Sets the route forwarding address.
large-community-delete-list
Sets the BGP large community delete list.
tag
Sets a tag on the matched route. You can specify a value between 1 and 4294967295.
atomic-aggregate
Sets the Atomic Aggregate attribute to inform BGP peers that the local router is using a less specific (aggregated) route to a destination.
ip-nexthop
Sets the BGP next hop.
local-preference
Sets the BGP local preference to local_pref.
weight
Sets the route’s weight.
community
Sets the BGP community attribute.
ipv6-nexthop-global
Sets the IPv6 next hop global attribute.
metric
Sets the BGP attribute MED to a specific value. You can specify metric-minus to subtract the specified value from the MED, metric-plus to add the specified value to the MED, rtt to set the MED to the round trip time, rtt-minus to subtract the round trip time from the MED, or rtt-plus to add the round trip time to the MED.
community-delete-list
Sets the BGP community delete list.
ipv6-nexthop-local
Sets the IPv6 next hop local attribute.
metric-type
Sets the metric type. You can specify type-1 or type-2.
ext-community-bw
Sets the BGP extended community link bandwidth.
ipv6-nexthop-prefer-global
Sets IPv6 inbound routes to use the global address when both a global and link-local next hop is available.
origin
Sets the BGP route origin, such as eBGP or iBGP.
Apply a Route Map
To apply the route map, you specify the routing protocol and the route map name.
The following example commands apply the route map called routemap2 to BGP neighbor swp51:
cumulus@switch:~$ sudo vtysh
switch# configure terminal
switch(config)# router bgp 65101
switch(config-router)# address-family ipv4 unicast
switch(config-router-af)# neighbor swp51 route-map routemap2 in
switch(config-router-af)# end
switch# wr mem
Note: this version of vtysh never writes vtysh.conf
Building Configuration...
Integrated configuration saved to /etc/frr/frr.conf
[OK]
switch# exit
cumulus@switch:mgmt:~$
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
cumulus@switch:~$ sudo cat /etc/frr/frr.conf
...
neighbor swp51 route-map routemap2 in
The following example filters routes from Zebra (RIB) into the Linux kernel (FIB). The commands apply the route map called routemap1 to BGP routes in the RIB:
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
cumulus@switch:~$ sudo cat /etc/frr/frr.conf
...
ip protocol bgp route-map routemap1
For BGP, you can also apply a route map on route updates from BGP to the RIB. You can match on prefix, next hop, communities, and so on. You can set the metric and next hop only. Route maps do not affect the BGP internal RIB. You can use both IPv4 and IPv6 address families. Route maps work on multi-paths; however, BGP bases the metric setting on the best path only.
To apply a route map to filter route updates from BGP into the RIB:
To apply an outbound route map to a route reflector client, you must run the NVUE nv set vrf <vrf> router bgp route-reflection outbound-policy on command or the vtysh neighbor <neighbor> route-map SET_IBGP_ORIG out command under the address family, before you apply the route map.
Route Map Description
To provide a description for a route map, run the NVUE nv set router policy route-map <route-map> rule <rule> description command.
cumulus@switch:~$ nv set router policy route-map routemap1 rule 10 match interface swp51
cumulus@switch:~$ nv set router policy route-map routemap1 rule 10 set metric 50
cumulus@switch:~$ nv set router policy route-map routemap1 rule 10 action permit
cumulus@switch:~$ nv set router policy route-map routemap1 rule 10 description set-metric-swp51
cumulus@switch:~$ nv config apply
Clear Matches Against a Route Map
To clear the number of matches shown against a route map, run the nv action clear router policy route-map <route-map> command.
The following example clears the number of matches shown against the route map called ROUTEMAP1.
Route redistribution allows a network to use a routing protocol to route traffic dynamically based on the information learned from a different routing protocol or from static routes. Route redistribution helps increase accessibility within networks.
The following example commands redistribute routing information from OSPF routes into BGP:
The following example sets communities based on prefix-lists:
...
router bgp 65101
bgp router-id 10.10.10.1
neighbor underlay interface remote-as external
!
address-family ipv6 unicast
neighbor underlay activate
neighbor underlay route-map MARK-PREFIXES out
exit-address-family
!
ipv6 prefix-list LOW-PRIO seq 5 permit 2001:db8:dead::/56 le 64
ipv6 prefix-list MID-PRIO seq 5 permit 2001:db8:beef::/56 le 64
ipv6 prefix-list HI-PRIO seq 5 permit 2001:db8:cafe::/56 le 64
!
route-map MARK-PREFIXES permit 10
match ipv6 address prefix-list LOW-PRIO
set community 123:200
!
route-map MARK-PREFIXES permit 20
match ipv6 address prefix-list MID-PRIO
set community 123:500
!
route-map MARK-PREFIXES permit 30
match ipv6 address prefix-list HI-PRIO
set community 123:1000
!
The following example filters routes from advertising to the peer:
router bgp 65101
bgp router-id 10.10.10.1
neighbor underlay interface remote-as external
!
address-family ipv4 unicast
neighbor underlay route-map POLICY-OUT out
exit-address-family
!
ip prefix-list BLOCK-RFC1918 seq 5 permit 10.0.0.0/8 le 24
ip prefix-list BLOCK-RFC1918 seq 10 permit 172.16.0.0/12 le 24
ip prefix-list BLOCK-RFC1918 seq 15 permit 192.168.0.0/16 le 24
ip prefix-list ADD-COMM-OUT seq 5 permit 100.64.0.0/10 le 24
ip prefix-list ADD-COMM-OUT seq 10 permit 192.0.2.0/24
!
route-map POLICY-OUT deny 10
match ip address prefix-list BLOCK-RFC1918
!
route-map POLICY-OUT permit 20
match ip address prefix-list ADD-COMM-OUT
set community 123:1000
!
route-map POLICY-OUT permit 30
The following example sets mutual redistribution between OSPF and BGP (filters by route tags):
...
router ospf
redistribute bgp route-map BGP-INTO-OSPF
!
router bgp 65101
bgp router-id 10.10.10.1
neighbor underlay interface remote-as external
!
address-family ipv4 unicast
redistribute ospf route-map OSPF-INTO-BGP
exit-address-family
!
route-map OSPF-INTO-BGP deny 10
match tag 4271
!
route-map OSPF-INTO-BGP permit 20
set tag 2328
!
route-map BGP-INTO-OSPF deny 10
match tag 2328
!
route-map BGP-INTO-OSPF permit 20
set tag 4271
The following example filters and modifies redistributed routes:
...
router ospf
redistribute bgp route-map EXTERNAL-2-1K
!
route-map EXTERNAL-2-1K permit 10
set metric 1000
set metric-type type-1
Considerations
When you configure a route map to match a prefix list, community list, or aspath list, the permit or deny actions in the list determine the criteria to evaluate in each route map sequence; for example:
If you match a list in a route map permit sequence, Cumulus Linux matches the permitted routes in the list for that route map sequence and the policy permits them. Denied routes in the list do not match and Cumulus Linux evaluates them in later route map sequences.
If you match a list in a route map deny sequence, Cumulus Linux matches the permitted routes in the list for that route map sequence and the policy denies them. Denied routes in the list do not match and Cumulus Linux evaluates them in later route map sequences.
NVIDIA recommends you always configure a community list as permit, and permit or deny routes using route map sequences.
Policy-based Routing
Typical routing systems and protocols forward traffic based on the destination address in the packet, which they look up in a routing table. However, sometimes the traffic on your network requires a more hands-on approach. Sometimes, you need to forward a packet based on the source address, the packet size, or other information in the packet header.
PBR lets you make routing decisions based on filters that change the routing behavior of specific traffic so that you can override the routing table and influence where the traffic goes. For example, you can use PBR to reach the best bandwidth utilization for business-critical applications, isolate traffic for inspection or analysis, or manually load balance outbound traffic.
Cumulus Linux applies PBR to incoming packets. All packets received on a PBR-enabled interface pass through enhanced packet filters that determine rules and specify where to forward the packets.
You can create a maximum of 255 PBR match rules and 256 next hop groups (this is the ECMP limit).
You can apply only one PBR policy per input interface.
You can match on source and destination IP address, or match on DSCP or ECN values within a packet.
PBR is not supported for VXLAN tunneling.
PBR is not supported on management interfaces, such as eth0.
A PBR rule cannot contain both IPv4 and IPv6 addresses.
Configure PBR
A PBR policy contains one or more policy maps. Each policy map:
Has a unique map name and sequence (rule) number. The rule number determines the relative order of the map within the policy.
Contains a match source IP rule and (or) a match destination IP rule and a set rule, or a match DSCP or ECN rule and a set rule.
To match on a source and destination address, a policy map can contain both match source and match destination IP rules.
A set rule determines the PBR next hop for the policy.
To use PBR in Cumulus Linux, you define a PBR policy and apply it to the ingress interface (the interface must already have an IP address assigned). Cumulus Linux matches traffic against the match rules in sequential order and forwards the traffic according to the set rule in the first match. Traffic that does not match any rule passes on to the normal destination based routing mechanism.
To configure a PBR policy:
Configure the policy map.
The example commands below configure a policy map called map1 with rule number 1 that matches on destination address 10.1.2.0/24 and source address 10.1.4.1/24.
If the IP address in the rule is 0.0.0.0/0 or ::/0, any IP address is a match. You cannot mix IPv4 and IPv6 addresses in a rule.
cumulus@switch:~$ nv set router pbr map map1 rule 1 match destination-ip 10.1.2.0/24
cumulus@switch:~$ nv set router pbr map map1 rule 1 match source-ip 10.1.4.1/24
Instead of matching on IP address, you can match packets according to the DSCP or ECN field in the IP header. The DSCP value can be an integer between 0 and 63 or the DSCP codepoint name. The ECN value can be an integer between 0 and 3. The following example command configures a policy map called map1 with rule number 1 that matches on packets with the DSCP value 10:
cumulus@switch:~$ nv set router pbr map map1 rule 1 match dscp 10
The following example command configures a policy map called map1 with rule number 1 that matches on packets with the ECN value 2:
cumulus@switch:~$ nv set router pbr map map1 rule 1 match ecn 2
Apply a next hop group to the policy map. First configure the next hop group, then apply the group to the policy map. The example commands below create a next hop group called group1 that contains the next hop 192.168.0.21 on output interface swp1 and VRF RED and the next hop 192.168.0.22, then applies the next hop group group1 to the map1 policy map.
The output interface and VRF are optional. However, you must specify the VRF if the next hop is not in the default VRF.
cumulus@switch:~$ nv set router nexthop group group1 via 192.168.0.21 interface swp1
cumulus@switch:~$ nv set router nexthop group group1 via 192.168.0.21 vrf RED
cumulus@switch:~$ nv set router nexthop group group1 via 192.168.0.22
cumulus@switch:~$ nv set router pbr map map1 rule 1 action nexthop-group group1
If you want the rule to use a specific VRF table as its lookup, set the VRF. If you do not set a VRF, the rule uses the VRF table the interface is in as its lookup. The example command below sets the rule to use the dmz VRF table.
You can set the VRF in a virtual environment only. Cumulus Linux on an NVIDIA switch does not support setting the VRF.
Restarting FRR restarts all the routing protocol daemons that are enabled and running.
Configure the policy map.
The example commands below configure a policy map called map1 with sequence number 1, that matches on destination address 10.1.2.0/24 and source address 10.1.4.1/24.
If the IP address in the rule is 0.0.0.0/0 or ::/0, any IP address is a match. You cannot mix IPv4 and IPv6 addresses in a rule.
Instead of matching on IP address, you can match packets according to the DSCP or ECN field in the IP header. The DSCP value can be an integer between 0 and 63 or the DSCP codepoint name. The ECN value can be an integer between 0 and 3. The following example command configures a policy map called map1 with sequence number 1 that matches on packets with the DSCP value 10:
Apply a next hop group to the policy map. First configure the next hop group, then apply the group to the policy map. The example commands below create a next hop group called group1 that contains the next hop 192.168.0.21 on output interface swp1 and VRF RED, and the next hop 192.168.0.22, then applies the next hop group group1 to the map1 policy map.
The output interface and VRF are optional. However, you must specify the VRF if the next hop is not in the default VRF.
If you want the rule to use a specific VRF table as its lookup, set the VRF. If you do not set a VRF, the rule uses the VRF table the interface is in as its lookup. The example command below sets the rule to use the dmz VRF table.
You can set the VRF in a virtual environment only. Cumulus Linux on an NVIDIA switch does not support setting the VRF.
Instead of a next hop group, you can apply a next hop to the policy map. The example command below applies the next hop 192.168.0.31 on the output interface swp2 and VRF RED to the map1 policy map. The next hop must be an IP address. The output interface and VRF are optional, however, you must specify the VRF you want to use for resolution if the next hop is not in the default VRF.
switch(config-pbr-map)# set nexthop 192.168.0.31 swp2 nexthop-vrf RED
switch(config-pbr-map)# exit
switch(config)#
Assign the PBR policy to an ingress interface. The example command below assigns the PBR policy map1 to interface swp51:
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
nexthop-group group1
nexthop 192.168.0.21 nexthop-vrf RED swp1
nexthop 192.168.0.22
pbr-map map1 seq 1
match dst-ip 10.1.2.0/24
match src-ip 10.1.4.1/24
set nexthop-group group1
interface swp51
pbr-policy map1
...
You can only set one policy per interface.
Modify PBR Rules
When you want to change or extend an existing PBR rule, you must first delete the conditions in the rule, then add the rule back with the modification or addition.
▼
Modify an existing match/set condition
The example below shows an existing configuration.
cumulus@switch:~$ nv set router pbr map pbr-policy rule 4 match source-ip 10.1.4.1/24
cumulus@switch:~$ nv set router pbr map pbr-policy rule 4 match destination-ip 10.1.2.0/24
cumulus@switch:~$ nv set router nexthop group group1 via 192.168.0.21
cumulus@switch:~$ nv set router pbr map pbr-policy rule 4 action nexthop-group group1
To change the source IP match from 10.1.4.1/24 to 10.1.4.2/24, you must delete the existing sequence by explicitly specifying the match/set condition. For example:
cumulus@switch:~$ nv unset router pbr map pbr-policy rule 4 match source-ip
cumulus@switch:~$ nv unset router pbr map pbr-policy rule 4 match destination-ip
cumulus@switch:~$ nv unset router nexthop group group1 via 192.168.0.21
Add the new rule with the following commands:
cumulus@switch:~$ nv set router pbr map pbr-policy rule 4 match source-ip 10.1.4.2/24
cumulus@switch:~$ nv set router pbr map pbr-policy rule 4 match destination-ip 10.1.2.0/24
cumulus@switch:~$ nv set router nexthop group group1 via 192.168.0.21
cumulus@switch:~$ nv config apply
Run the vtysh show pbr map command to verify that the rule has the updated source IP match:
cumulus@switch:~$ nv set router pbr map pbr-policy rule 3 match source-ip 10.1.4.1/24
cumulus@switch:~$ nv set router nexthop group group1 via 192.168.0.21
To add a destination IP match to the rule, you must delete the existing rule sequence:
cumulus@switch:~$ nv router pbr map pbr-policy rule 3 match source-ip
cumulus@switch:~$ nv unset router nexthop group group1 via 192.168.0.21
cumulus@switch:~$ nv config apply
Add back the source IP match and next hop condition, and add the new destination IP match (dst-ip 10.1.2.0/24):
cumulus@switch:~$ nv set router pbr map pbr-policy rule 3 match source-ip 10.1.4.1/24
cumulus@switch:~$ nv set router pbr map pbr-policy rule 3 match destination-ip 10.1.2.0/24
cumulus@switch:~$ nv set router nexthop group group1 via 192.168.0.21
cumulus@switch:~$ nv config apply
Run the vtysh show pbr map command to verify the update:
To remove a PBR map and the corresponding next hop group, you must first delete the PBR map and run nv config apply, then remove the corresponding next hop group; for example:
The following examples show how to delete a PBR rule:
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# no pbr-map map1 seq 1
switch(config)# end
switch# write memory
switch# exit
If a PBR rule has multiple conditions (for example, a source IP match and a destination IP match), but you only want to delete one condition, you have to delete all conditions first, then re-add the ones you want to keep.
The example below shows an existing configuration that has a source IP match and a destination IP match.
cumulus@switch:~$ nv set router pbr map pbr-policy rule 6 match source-ip 10.1.4.1/24
cumulus@switch:~$ nv set router pbr map pbr-policy rule 6 match destination-ip 10.1.2.0/24
cumulus@switch:~$ nv set router nexthop group group1 via 192.168.0.21
To remove the destination IP match, you must first delete all existing conditions defined under this sequence:
cumulus@switch:~$ nv unset router pbr map pbr-policy rule 6 match source-ip
cumulus@switch:~$ nv unset router pbr map pbr-policy rule 6 match destination-ip
cumulus@switch:~$ nv unset router nexthop group group1 via 192.168.0.21
cumulus@switch:~$ nv config apply
Then, add back the conditions you want to keep:
cumulus@switch:~$ nv set router pbr map pbr-policy rule 6 match source-ip 10.1.4.1/24
cumulus@switch:~$ nv unset router nexthop group group1 via 192.168.0.21
cumulus@switch:~$ nv config apply
Troubleshooting
To see the policies applied to all interfaces on the switch, run the NVUE nv show router pbr -o json command:
To see the policies applied to a specific interface on the switch, run the NVUE nv show interface <interface> router pbr command or the vtysh show pbr interface <interface> command.
To see information about all policies, including mapped table and rule numbers, run the NVUE nv show router pbr map command or the vtysh show pbr map command. If the rule is not set, you see a reason why.
To see information about a specific next hop group, run the vtysh show pbr nexthop-group group1 command.
Each next hop and next hop group uses a new Linux routing table ID.
To show the reserved routing table range, run the NVUE nv show system global reserved routing-table pbr command.
cumulus@switch:~$ nv show system global reserved routing-table pbr
operational applied
----- ----------- ----------
begin 10000 10000
end 4294966272 4294966272
Example Configuration
In the following example, the PBR-enabled switch has a PBR policy to route all traffic from the Internet to a server that performs anti-DDOS. After cleaning, the traffic returns to the PBR-enabled switch and then passes on to the regular destination-based routing mechanism.
cumulus@switch:~$ nv set router pbr map map1 rule 1 match source-ip 0.0.0.0/0
cumulus@switch:~$ nv set router nexthop group group1 via 192.168.0.32
cumulus@switch:~$ nv set router pbr map map1 rule 1 action nexthop-group group1
cumulus@switch:~$ nv set interface swp51 router pbr map map1
cumulus@switch:~$ nv config apply
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
interface swp51
pbr-policy map1
nexthop-group group1
nexthop 192.168.0.32
pbr-map map1 seq 1
match src-ip 0.0.0.0/0
set nexthop-group group1
...
Equal Cost Multipath Load Sharing - Hardware ECMP
Cumulus Linux enables hardware ECMP by default. Load sharing occurs automatically for all routes with multiple next hops installed. ECMP load sharing supports both IPv4 and IPv6 routes.
How Does ECMP Work?
ECMP operates only on equal cost routes in the Linux routing table. In the following example, the 10.10.10.3/32 route has four possible next hops installed in the routing table:
cumulus@leaf01:mgmt:~$ net show route 10.10.10.3/32
RIB entry for 10.10.10.3/32
===========================
Routing entry for 10.10.10.3/32
Known via "bgp", distance 20, metric 0, best
Last update 10:04:41 ago
* fe80::4ab0:2dff:fe60:910e, via swp54, weight 1
* fe80::4ab0:2dff:fea7:7852, via swp53, weight 1
* fe80::4ab0:2dff:fec8:8fb9, via swp52, weight 1
* fe80::4ab0:2dff:feff:e147, via swp51, weight 1
FIB entry for 10.10.10.3/32
===========================
10.10.10.3 nhid 108 proto bgp metric 20
For Cumulus Linux to consider routes equal, the routes must:
Originate from the same routing protocol. Routes from different sources are not considered equal. For example, a static route and an OSPF route are not considered for ECMP load sharing.
Have equal cost. If two routes from the same protocol are unequal, only the best route installs in the routing table.
When multiple routes are in the routing table, a hash determines through which path a packet follows. To prevent out of order packets, ECMP hashes on a per-flow basis; all packets with the same source and destination IP addresses and the same source and destination ports always hash to the same next hop. ECMP hashing does not keep a record of packets that hash to each next hop and does not guarantee that traffic to each next hop is equal.
Cumulus Linux enables the BGP maximum-paths setting by default and installs multiple routes. Refer to BGP and ECMP.
Next Hop Groups
ECMP routes resolve to next hop groups, which identify one or more next hops. To view next hop information, run the NVUE nv show router nexthop rib or nv show router nexthop rib <id> commands, or the ip nexthop show or ip nexthop show <id> kernel commands.
cumulus@leaf01:mgmt:~$ nv show router nexthop rib
Nexthop-group address-family installed interface-index ref-count type valid vrf Summary
------------- -------------- --------- --------------- --------- ----- ----- ------- ------------------
...
75 ipv4 on 74 2 zebra on default
76 ipv4 on 74 2 zebra on default
77 unspecified on 2 zebra on default Nexthop-group: 78
Nexthop-group: 79
Nexthop-group: 78
Nexthop-group: 79
78 ipv4 on 67 3 zebra on default
79 ipv4 on 67 3 zebra on default
90 ipv6 on 55 8 zebra on default
96 ipv6 on 54 8 zebra on default
108 unspecified on 6 zebra on default Nexthop-group: 109
Nexthop-group: 65
Nexthop-group: 90
Nexthop-group: 96
Nexthop-group: 109
Nexthop-group: 65
Nexthop-group: 90
Nexthop-group: 96
...
The following example shows information for next hop group 108:
nv set system forwarding ecmp-hash inner-ip-protocol on
nv set system forwarding ecmp-hash inner-ip-protocol off
hash_config.inner_ip_prot
Inner source IP address
off
nv set system forwarding ecmp-hash inner-source-ip on
nv set system forwarding ecmp-hash inner-source-ip off
hash_config.inner_sip
Inner destination IP address
off
nv set system forwarding ecmp-hash inner-destination-ip on
nv set system forwarding ecmp-hash inner-destination-ip off
hash_config.inner_dip
Inner source port
off
nv set system forwarding ecmp-hash inner-source-port on
nv set system forwarding ecmp-hash inner-source-port off
hash_config.inner-sport
Inner destination port
off
nv set system forwarding ecmp-hash inner-destination-port on
nv set system forwarding ecmp-hash inner-destination-port off
hash_config.inner_dport
Inner IPv6 flow label
off
nv set system forwarding ecmp-hash inner-ipv6-label on
nv set system forwarding ecmp-hash inner-ipv6-label off
hash_config.inner_ip6_label
The following example commands omit the source port and destination port from the hash calculation:
cumulus@switch:~$ nv set system forwarding ecmp-hash source-port off
cumulus@switch:~$ nv set system forwarding ecmp-hash destination-port off
cumulus@switch:~$ nv config apply
Use the instructions below when NVUE is not enabled. If you are using NVUE to configure your switch, the NVUE commands change the settings in /etc/cumulus/datapath/nvue_traffic.conf which takes precedence over the settings in /etc/cumulus/datapath/traffic.conf.
Edit the /etc/cumulus/datapath/traffic.conf file:
Uncomment the hash_config.enable = true option.
Set the hash_config.sport and hash_config.dport options to false.
cumulus@switch:~$ sudo nano /etc/cumulus/datapath/traffic.conf
...
# HASH config for ECMP to enable custom fields
# Fields will be applicable for ECMP hash
# calculation
#Note : Currently supported only for MLX platform
# Uncomment to enable custom fields configured below
hash_config.enable = true
#hash Fields available ( assign true to enable)
#ip protocol
hash_config.ip_prot = true
#source ip
hash_config.sip = true
#destination ip
hash_config.dip = true
#source port
hash_config.sport = false
#destination port
hash_config.dport = false
...
Run the echo 1 > /cumulus/switchd/ctrl/hash_config_reload command. This command does not cause any traffic interruptions.
Cumulus Linux enables symmetric hashing by default. Make sure that the settings for the source IP and destination IP fields match, and that the settings for the source port and destination port fields match; otherwise Cumulus Linux disables symmetric hashing automatically. If necessary, you can disable symmetric hashing manually in the /etc/cumulus/datapath/traffic.conf file by setting symmetric_hash_enable = FALSE.
GTP Hashing
GTP carries mobile data within the core of the mobile operator’s network. Traffic in the 5G Mobility core cluster, from cell sites to compute nodes, have the same source and destination IP address. The only way to identify individual flows is with the GTP TEID. Enabling GTP hashing adds the TEID as a hash parameter and helps the Cumulus Linux switches in the network to distribute mobile data traffic evenly across ECMP routes.
Cumulus Linux supports TEID-based ECMP hashing for:
GTP TEID-based ECMP hashing is only applicable if the outer header egressing the port is GTP encapsulated and if the ingress packet is either a GTP-U packet or a VXLAN encapsulated GTP-U packet.
Cumulus Linux supports GTP Hashing on NVIDIA Spectrum-2 and later.
cumulus@switch:~$ nv set system forwarding ecmp-hash gtp-teid on
cumulus@switch:~$ nv config apply
To disable TEID-based ECMP hashing, run the nv set system forwarding ecmp-hash gtp-teid off command.
Use the instructions below when NVUE is not enabled. If you are using NVUE to configure your switch, the NVUE commands change the settings in /etc/cumulus/datapath/nvue_traffic.conf which takes precedence over the settings in /etc/cumulus/datapath/traffic.conf.
Edit the /etc/cumulus/datapath/traffic.conf file and change the lag_hash_config.gtp_teid parameter to true:
To disable TEID-based ECMP hashing, set the hash_config.gtp_teid parameter to false, then reload the configuration.
To show that TEID-based ECMP hashing is on, run the command:
cumulus@switch:~$ nv show system forwarding ecmp-hash
applied description
----------------- ------- -----------------------------------
destination-ip on Destination IPv4/IPv6 Address
destination-port on TCP/UDP destination port
gtp-teid on GTP-U TEID
...
Unique Hash Seed
You can configure a unique hash seed for each switch to prevent hash polarization, a type of network congestion that occurs when multiple data flows try to reach a switch using the same switch ports.
You can set a hash seed value between 0 and 4294967295. If you do not specify a value, switchd creates a randomly generated seed.
To configure the hash seed:
cumulus@switch:~$ nv set system forwarding hash-seed 50
cumulus@switch:~$ nv config apply
If you do not enable NVUE, use the instructions below. If you are using NVUE to configure your switch, the NVUE commands change the settings in /etc/cumulus/datapath/nvue_traffic.conf which takes precedence over the settings in /etc/cumulus/datapath/traffic.conf.
Edit /etc/cumulus/datapath/traffic.conf file to change the ecmp_hash_seed parameter, then restart switchd.
cumulus@switch:~$ sudo nano /etc/cumulus/datapath/traffic.conf
...
#Specify the hash seed for Equal cost multipath entries
# and for custom ecmp and lag hash
# Default value: random
# Value Range: {0..4294967295}
ecmp_hash_seed = 50
...
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Resilient Hashing
In Cumulus Linux, when a next hop fails or you remove the next hop from an ECMP pool, the hashing or hash bucket assignment can change. Resilient hashing is an alternate way to manage ECMP groups. Cumulus Linux assigns next hops to buckets using their hashing header fields and uses the resulting hash to index into the table of 2^n hash buckets. Because all packets in a given flow have the same header hash value, they all use the same flow bucket.
Resilient hashing supports both IPv4 and IPv6 routes.
Resilient hashing prevents disruption when you remove next hops but does not prevent disruption when you add next hops.
The NVIDIA Spectrum ASIC assigns packets to hash buckets and assigns hash buckets to next hops. The ASIC also runs a background thread that monitors buckets and can migrate buckets between next hops to rebalance the load.
When you remove a next hop, Cumulus Linux distributes the assigned buckets to the remaining next hops.
When you add a next hop, Cumulus Linux assigns no buckets to the new next hop until the background thread rebalances the load.
The load rebalances when the active flow timer expires only if there are inactive hash buckets available; the new next hop can remain unpopulated until the period set in active flow timer expires.
When the unbalanced timer expires and the load is not balanced, the thread migrates buckets to different next hops to rebalance the load.
Any flow can migrate to any next hop, depending on flow activity and load balance conditions. Over time, the flow can get pinned, which is the default setting and behavior.
When you enable resilient hashing, Cumulus Linux assigns next hops in round robin fashion to a fixed number of buckets. In this example, there are 12 buckets and four next hops.
Unlike default ECMP hashing, when you need to remove a next hop, the number of hash buckets does not change.
With 12 buckets and four next hops, instead of reducing the number of buckets, which impacts flows to known good hosts, the remaining next hops replace the failed next hop.
After you remove the failed next hop, the remaining next hops replace it. This prevents impact to any flows that hash to working next hops.
Resilient hashing does not prevent possible impact to existing flows when you add new next hops. Because there are a fixed number of buckets, a new next hop requires reassigning next hops to buckets.
As a result, some flows hash to new next hops, which can impact anycast deployments.
Cumulus Linux does not enable resilient hashing by default. When you enable resilient hashing, all ECMP groups share 65,536 buckets. An ECMP group is a list of unique next hops that multiple ECMP routes reference.
An ECMP route counts as a single route with multiple next hops.
All ECMP routes must use the same number of buckets (you cannot configure the number of buckets per ECMP route).
A larger number of ECMP buckets reduces the impact on adding new next hops to an ECMP route. However, the system supports fewer ECMP routes. If you install the maximum number of ECMP routes, new ECMP routes log an error and do not install.
You can configure route and MAC address hardware resources depending on ECMP bucket size changes. See NVIDIA Spectrum routing resources.
To enable resilient hashing:
Cumulus Linux does not provide NVUE commands for this setting.
Edit the /etc/cumulus/datapath/traffic.conf file to uncomment and set the resilient_hash_enable parameter to TRUE.
You can also set the resilient_hash_entries_ecmp parameter to the number of hash buckets to use for all ECMP routes. On Spectrum switches, you can set the number of buckets to 64, 512, 1024, 2048, or 4096. On NVIDIA Spectrum-2 and later, you can set the number of buckets to 64, 128, 256, 512, 1024, 2048, or 4096. The default value is 64.
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Resilient hashing in hardware does not work with next hop groups; the switch remaps flows to new next hops when the set of nexthops changes. To work around this issue, configure zebra not to install next hop IDs in the kernel with the following vtysh command:
cumulus@switch:~$ sudo vtysh
switch# configure terminal
switch(config)# zebra nexthop proto only
switch(config)# exit
switch# write memory
switch# exit
cumulus@switch:~$
Considerations
When the router adds or removes ECMP paths, or when the next hop IP address, interface, or tunnel changes, the next hop information for an IPv6 prefix can change. FRR deletes the existing route to that prefix from the kernel, then adds a new route with all the relevant new information. In certain situations, Cumulus Linux does not maintain resilient hashing for IPv6 flows.
To work around this issue, you can enable IPv6 route replacement.
For certain configurations, IPv6 route replacement can lead to incorrect forwarding decisions and lost traffic. For example, it is possible for a destination to have next hops with a gateway value with the outbound interface or just the outbound interface itself, without a gateway address. If both types of next hops for the same destination exist, route replacement does not operate correctly; Cumulus Linux adds an additional route entry and next hop but does not delete the previous route entry and next hop.
To enable the IPv6 route replacement option:
In the /etc/frr/daemons file, add the configuration option --v6-rr-semantics to the zebra daemon definition. For example:
cumulus@switch:~$ sudo nano /etc/frr/daemons
...
vtysh_enable=yes
zebra_options=" -M cumulus_mlag -M snmp -A 127.0.0.1 --v6-rr-semantics -s 90000000"
bgpd_options=" -M snmp -A 127.0.0.1"
ospfd_options=" -M snmp -A 127.0.0.1"
...
Restarting FRR restarts all the routing protocol daemons that are enabled and running.
To verify that IPv6 route replacement, run the systemctl status frr command:
cumulus@switch:~$ systemctl status frr
● frr.service - FRRouting
Loaded: loaded (/lib/systemd/system/frr.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2020-02-03 20:02:33 UTC; 3min 8s ago
Docs: https://frrouting.readthedocs.io/en/latest/setup.html
Process: 4675 ExecStart=/usr/lib/frr/frrinit.sh start (code=exited, status=0/SUCCESS)
Memory: 14.4M
CGroup: /system.slice/frr.service
├─4685 /usr/lib/frr/watchfrr -d zebra bgpd staticd
├─4701 /usr/lib/frr/zebra -d -M snmp -A 127.0.0.1 --v6-rr-semantics -s 90000000
├─4705 /usr/lib/frr/bgpd -d -M snmp -A 127.0.0.1
└─4711 /usr/lib/frr/staticd -d -A 127.0.0.1
cl-ecmpcalc
Run the cl-ecmpcalc command to determine a hardware hash result. For example, you can see which path a flow takes through a network. You must provide all fields in the hash, including the ingress interface, layer 3 source IP, layer 3 destination IP, layer 4 source port, and layer 4 destination port.
cl-ecmpcalc only supports input interfaces that convert to a single physical port in the port tab file, such as the physical switch ports (swp). You can not specify virtual interfaces like bridges, bonds, or subinterfaces.
Adaptive routing is a beta feature and open to customer feedback. This feature is not currently intended to run in production and is not supported through NVIDIA networking support.
Adaptive routing is a load balancing mechanism that improves network utilization by selecting routes dynamically based on the immediate network state, such as switch queue length and port utilization.
The benefits of using adaptive routing include:
The switch can forward RoCE traffic over all the available ECMP member ports to maximize the total traffic throughput.
For leaf to spine traffic flows, the switch distributes incoming traffic equally between the available spines, which helps to minimize latency and congestion on network resources.
If the cumulative rate of one or more RoCE traffic streams exceeds the link bandwidth of the individual uplink port, adaptive routing can distribute the traffic dynamically between multiple uplink ports; the available bandwidth for RoCE traffic is not limited to the link bandwidth of the individual uplink port.
If the link bandwidth of the individual uplink ports is lower than that of the ingress port, RoCE traffic can flow through; the switch distributes the traffic between the available ECMP member ports without affecting the existing traffic.
Cumulus Linux only supports adaptive routing with:
Physical uplink (layer 3) ports; you cannot configure adaptive routing on subinterfaces, SVIs, bonds, or ports that are part of a bond.
Interfaces in the default VRF
Adaptive routing does not make use of resilient hashing.
You cannot use adaptive routing with EVPN or VXLAN.
With adaptive routing, packets route to the less loaded path on a per packet basis to best utilize the fabric resources and avoid congestion for the specific time duration. This mode is more time effective and restricts the port selection change decision to a predefined time.
The change decision for port selection is set to one microsecond; you cannot change it.
You must configure adaptive routing on all ports that are part of the same ECMP route. Make sure the ports are physical uplink ports.
To enable adaptive routing:
For each port on which you want to enable adaptive routing, run the nv set interface <interface> router adaptive-routing enable on command.
cumulus@switch:~$ nv set interface swp51 router adaptive-routing enable on
cumulus@switch:~$ nv config apply
When you run the above command, NVUE:
Enables the adaptive routing feature.
Sets adaptive routing on the specified port.
Sets the link utilization threshold percentage to the default value of 70. Adaptive routing considers the port congested based on the link utilization threshold.
To change the link utilization threshold percentage, run the nv set interface <interface> router adaptive-routing link-utilization-threshold command. You can set a value between 1 and 100.
To disable adaptive routing globally, run the nv set router adaptive-routing enable off command. To disable adaptive routing on a port, run the nv set interface <interface> router adaptive-routing enable off command.
When you enable adaptive routing, NVUE restarts the switchd service, which causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Edit the /etc/cumulus/switchd.d/adaptive_routing.conf file:
Set the global adaptive_routing.enable parameter to TRUE.
For each port on which you want to enable adaptive routing, set the interface.<port>.adaptive_routing.enable parameter to TRUE.
For each port on which you want to enable adaptive routing, set the interface.<port>.adaptive_routing.link_util_thresh parameter to configure the link utilization threshold percentage (optional). Adaptive routing considers the port congested based on the link utilization threshold. You can set a value between 1 and 100. The default value is 70.
cumulus@switch:~$ sudo nano /etc/cumulus/switchd.d/adaptive_routing.conf
## Global adaptive-routing enable/disable setting
adaptive_routing.enable = TRUE
## Supported AR profile modes : STICKY_FREE
#adaptive_routing.profile0.mode = STICKY_FREE
## Maximum value for buffer-congestion threshold is 16777216. Unit is in cells
#adaptive_routing.congestion_threshold.low = 100
#adaptive_routing.congestion_threshold.medium = 1000
#adaptive_routing.congestion_threshold.high = 10000
## Per-port configuration for adaptive-routing
interface.swp51.adaptive_routing.enable = TRUE
interface.swp51.adaptive_routing.link_util_thresh = 70
...
The /etc/cumulus/switchd.d/adaptive_routing.conf file contains the adaptive routing profile mode (STICKY_FREE) and default buffer congestion threshold settings; do not change these settings.
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
To disable adaptive routing globally, set the adaptive_routing.enable parameter to FALSE in the /etc/cumulus/switchd.d/adaptive_routing.conf file.
To disable adaptive routing on a port, set the interface.<port>.adaptive_routing.enable parameter to FALSE in the /etc/cumulus/switchd.d/adaptive_routing.conf file.
To verify that adaptive routing is on, run the nv show router adaptive-routing command:
cumulus@leaf01:mgmt:~$ nv show router adaptive-routing
applied description
------ ------- ------------------------------------------------------
enable on Turn the feature 'on' or 'off'. The default is 'off'.
To show adaptive routing configuration for an interface, run the nv show interface <interface> router adaptive-routing command:
cumulus@leaf01:mgmt:~$ nv show interface swp51 router adaptive-routing
applied description
-------------------------- ------- ------------------------------------------------------
enable on Turn the feature 'on' or 'off'. The default is 'off'.
link-utilization-threshold 100 Link utilization threshold percentage
Unequal Cost Multipath with BGP Link Bandwidth
You use UCMP in data center networks that rely on anycast routing to provide network-based load balancing. Cumulus Linux supports UCMP by using the BGP link bandwidth extended community to load balance traffic towards anycast services for IPv4 and IPv6 routes in a layer 3 deployment and for prefix (type-5) routes in an EVPN deployment.
UCMP Routing
In ECMP, the route to a destination has multiple next hops and traffic distributes across them equally. Flow-based hashing ensures that all traffic associated with a particular flow uses the same next hop and the same path across the network.
In UCMP, along with the ECMP flow-based hash, Cumulus Linux associates a weight with each next hop and distributes traffic across the next hops in proportion to their weight. The BGP link bandwidth extended community carries information about the anycast server distribution through the network, which maps to the weight of the corresponding next hop. The mapping factors the bandwidth value of a particular path against the total bandwidth values of all possible paths, mapped to the range 1 to 100. The BGP best path selection algorithm and the multipath computation algorithm that determines which paths you can use for load balancing does not change.
UCMP Example
The above example shows how traffic towards 192.168.10.1/32 is load balanced when you use UCMP routing:
leaf01 has two ECMP paths to 192.168.10.1/32 (through server01 and server03) whereas leaf03 and leaf04 have a single path to server04.
leaf01, leaf02, leaf03, and leaf04 generate a BGP link bandwidth based on the number of BGP multipaths for a prefix.
When announcing the prefix to the spines, leaf01 and leaf02 generate a link bandwidth of two while leaf03 and leaf04 generate a link bandwidth of one.
Each spine advertises the 192.168.10.1/32 prefix to the border leafs with an accumulated bandwidth of 6. This combines the value of 2 from leaf01, 2 from leaf02, 1 from leaf03 and 1 from leaf04.
Now, each spine has four UCMP routes:
through leaf01 with weight 2
through leaf02 with weight 2
through leaf03 with weight 1
through leaf04 with weight 1
The border leafs also have four UCMP routes:
through spine01 with weight 6
through spine02 with weight 6
through spine03 with weight 6
through spine04 with weight 6
The border leafs balance traffic equally; all weights are equal to the spines. Only the spines have unequal load sharing based on the weight values.
Configure UCMP
Set the BGP link bandwidth extended community in a route map against all prefixes, a specific prefix, or set of prefixes using the match clause of the route map. Apply the route map on the first device to receive the prefix; against the BGP neighbor that generated this prefix.
The BGP link bandwidth extended community uses bytes-per-second. To convert the number of ECMP paths, Cumulus Linux uses a reference bandwidth of 1024Kbps. For example, if there are four ECMP paths to an anycast IP, the encoded bandwidth in the extended community is 512,000. The actual value is not important, as long as all routers originating the link bandwidth convert the number of ECMP paths in the same way.
Cumulus Linux accepts the bandwidth extended community by default. You do not need to configure transit devices where UCMP routes are not originated.
Cumulus Linux does not provide NVUE commands for UCMP configuration.
The bandwidth used in the extended community has no impact on or relation to port bandwidth.
You can only apply the route weight information on the outbound direction to a peer; you cannot apply route weight information on the inbound direction from peers advertising routes to the switch.
Set the BGP Link Bandwidth Extended Community Against All Prefixes
The following example sets the BGP link bandwidth extended community against all prefixes.
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
address-family ipv4 unicast
neighbor 10.1.1.1 route-map ucmp-route-map out
!
route-map ucmp-route-map permit 10
set extcommunity bandwidth num-multipaths
...
Set the BGP Link Bandwidth Extended Community Against Certain Prefixes
The following example sets the BGP link bandwidth extended community for anycast servers in the 192.168/16 IP address range.
cumulus@switch:~$ nv set router policy prefix-list anycast_ip type ipv4
cumulus@switch:~$ nv set router policy prefix-list anycast_ip rule 1 match 192.168.0.0/16 max-prefix-len 30
cumulus@switch:~$ nv set router policy prefix-list anycast_ip rule 1 action permit
cumulus@switch:~$ nv set router policy route-map ucmp-route-map rule 1 action permit
cumulus@switch:~$ nv set router policy route-map ucmp-route-map rule 1 match ip-prefix-list anycast_ip
cumulus@switch:~$ nv set router policy route-map ucmp-route-map rule 1 set ext-community-bw multipaths
cumulus@switch:~$ nv set vrf default router bgp neighbor swp51 address-family ipv4-unicast policy outbound prefix-list anycast_ip
cumulus@switch:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh
...
leaf01# configure terminal
leaf01(config)# ip prefix-list anycast_ip seq 10 permit 192.168.0.0/16 le 32
leaf01(config)# route-map ucmp-route-map permit 10
leaf01(config-route-map)# match ip address prefix-list anycast_ip
leaf01(config-route-map)# set extcommunity bandwidth num-multipaths
leaf01(config-route-map)# router bgp 65011
leaf01(config-router)# address-family ipv4 unicast
leaf01(config-router-af)# neighbor swp51 prefix-list anycast_ip out
leaf01(config-router-af)# end
leaf01# write memory
leaf01# exit
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
address-family ipv4 unicast
neighbor 10.1.1.1 route-map ucmp-route-map out
!
ip prefix-list anycast-ip permit 192.168.0.0/16 le 32
route-map ucmp-route-map permit 10
match ip address prefix-list anycast-ip
set extcommunity bandwidth num-multipaths
...
EVPN Configuration
For EVPN configuration, make sure that you activate the commands under the EVPN address family. The following shows an example EVPN configuration that sets the BGP link bandwidth extended community against all prefixes.
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
address-family l2vpn evpn
advertise ipv4 unicast route-map ucmp-route-map
exit-address-family
!
ip prefix-list anycast-ip permit 192.168.0.0/16 le 32
route-map ucmp-route-map permit 10
match ip address prefix-list anycast-ip
set extcommunity bandwidth num-multipaths
...
Control UCMP on the Receiving Switch
To control UCMP on the receiving switch, you can:
Set default values for UCMP routes.
Disable the advertisement of all BGP extended communities on specific peerings.
Set Default Values for UCMP Routes
By default, if some of the multipaths do not have link bandwidth, Cumulus Linux ignores the bestpath bandwidth value in any of the multipaths and performs ECMP. However, you can set one of the following options instead:
Ignore link bandwidth and perform ECMP.
Skip paths without link bandwidth and perform UCMP among the others (if at least some paths have link bandwidth).
Assign a low default weight (value 1) to paths that do not have link bandwidth.
Change this setting per BGP instance for both IPv4 and IPv6 unicast routes in the BGP instance. For EVPN, set the options on the tenant VRF.
Run the NVUE nv set vrf <vrf> router bgp path-selection multipath bandwidth ignore, nv set vrf <vrf> router bgp path-selection multipath bandwidth skip-missing, or nv set vrf <vrf> router bgp path-selection multipath bandwidth default-weight-for-missing command.
The following example sets link bandwidth processing to skip paths without link bandwidth and perform UCMP among the other paths:
The BGP link bandwidth extended community is passed on automatically with the prefix to eBGP peers. If you do not want to pass on the BGP link bandwidth extended community outside of a particular domain, you can disable the advertisement of all BGP extended communities on specific peerings.
You cannot disable just the BGP link bandwidth extended community from advertising to a neighbor; you either send all BGP extended communities, or none.
The following example disables all BGP extended communities on a peer:
cumulus@switch:~$ nv set vrf default router bgp neighbor swp51 address-family ipv4-unicast community-advertise extended off
cumulus@switch:~$ nv config apply
To show the extended community in a received or local route, run the vtysh show bgp command or the net show bgp command.
The following example shows that the switch receives an IPv4 unicast route with the BGP link bandwidth attribute from two peers. The link bandwidth extended community is in bytes per second and shows in megabits per second: Extended Community: LB:65002:131072000 (1000.000 Mbps) and Extended Community: LB:65001:65536000 (500.000 Mbps).
cumulus@switch:~$ sudo vtysh
...
switch# show ip bgp ipv4 unicast 192.168.10.1/32
BGP routing table entry for 192.168.10.1/32
Paths: (2 available, best #2, table default)
Advertised to non peer-group peers:
l1(swp1) l2(swp2) l3(swp3) l4(swp4)
65002
fe80::202:ff:fe00:1b from l2(swp2) (10.0.0.2)
(fe80::202:ff:fe00:1b) (used)
Origin IGP, metric 0, valid, external, multipath, bestpath-from-AS 65002
Extended Community: LB:65002:131072000 (1000.000 Mbps)
Last update: Thu Feb 20 18:34:16 2020
65001
fe80::202:ff:fe00:15 from l1(swp1) (110.0.0.1)
(fe80::202:ff:fe00:15) (used)
Origin IGP, metric 0, valid, external, multipath, bestpath-from-AS 65001, best (Older Path)
Extended Community: LB:65001:65536000 (500.000 Mbps)
Last update: Thu Feb 20 18:22:34 2020
The bandwidth value used by UCMP is only to determine the percentage of load to a given next hop and has no impact on actual link or flow bandwidth.
To show EVPN type-5 routes, run the net show bgp l2vpn evpn route type prefix command or the vtysh show bgp l2vpn evpn route type prefix command.
The bandwidth shows both as bytes per second (unsigned 32 bits) as well as in Gbps, Mbps, or Kbps. For example:
cumulus@switch:~$ sudo vtysh
...
switch# show bgp l2vpn evpn route type prefix
BGP table version is 1, local router ID is 10.0.0.11
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
...
*> [5]:[0]:[32]:[192.168.10.1]
10.0.0.5 0 65100 65050 65200 i
RT:65050:104001 LB:65050:134217728 (1.000 Gbps) ET:8 Rmac:36:4f:15:ea:81:90
To see weights associated with next hops for a route with multiple paths, run the net show route command or the vtysh show ip route command. For example:
cumulus@switch:~$ sudo vtysh
...
switch# show ip route 192.168.10.1/32
Routing entry for 192.168.10.1/32
Known via "bgp", distance 20, metric 0, best
Last update 00:00:32 ago
* fe80::202:ff:fe00:1b, via swp2, weight 66
* fe80::202:ff:fe00:15, via swp1, weight 33
Considerations
UCMP with BGP link bandwidth is only available for BGP-learned routes.
Redistribute neighbor provides a way for IP subnets to span racks without forcing the end hosts to run a routing protocol by announcing individual host /32 routes in the routed fabric. Other hosts on the fabric can use this new path to access the hosts in the fabric. If ECMP is available, traffic can load balance across the available paths natively.
Hosts use ARP to resolve MAC addresses when sending to an IPv4 address. A host then builds an ARP cache table of known MAC addresses: IPv4 tuples as they receive or respond to ARP requests.
For a leaf switch, where hosts within the rack use the default gateway, the ARP cache table contains a list of all hosts that ARP for their default gateway. In most cases, this table contains all the layer 3 information necessary. Redistribute neighbor formats and synchronizes this table into the routing protocol.
The current implementation of redistribute neighbor:
Is not supported with EVPN. Enabling both redistribute neighbor and EVPN leads to unreachable IPv4 ARP and IPv6 neighbor entries.
Target Use Cases and Best Practices
You use redistribute neighbor in these configurations:
Virtualized clusters
Hosts with service IP addresses that migrate between racks
Hosts that are dual connected to two leaf nodes without using proprietary protocols such as MLAG
Anycast services that need dynamic advertisement from multiple hosts
Follow these guidelines:
You can connect a host to one or more leafs. Each leaf advertises the /32 it sees in its neighbor table.
Make sure that a host-bound bridge or VLAN is local to each switch.
Connect the leafs with redistribute neighbor directly to the hosts.
Make sure that IP addresses do not overlap, as the host IP addresses are directly advertised into the routed fabric.
Run redistribute neighbor on Linux-based hosts. NVIDIA does not test other host operating systems.
How Does Redistribute Neighbor Work?
Redistribute neighbor works as follows:
The leaf or ToR switch learns about connected hosts when the host sends an ARP request or ARP reply.
The kernel neighbor table adds an entry for the host of each leaf.
The redistribute neighbor daemon (rdnbrd) monitors the kernel neighbor table and creates a /32 route for each neighbor entry. This /32 route is in kernel table 10.
A route map controls which routes to import from table 10.
FRR imports these routes as table routes.
You configure BGP or OSPF to redistribute the table 10 routes.
Example Configuration
The following example configuration uses the following topology.
Configure the Leafs
Cumulus Linux does not provide NVUE commands redistribute neighbor configuration.
Edit the /etc/network/interfaces file to configure the same IP address with a /32 prefix on both interfaces that face the host. In this example, swp1 and swp2 face server01 and server02:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.0.0.1/32
auto swp1
iface swp1
address 10.0.0.1/32
auto swp2
iface swp2
address 10.0.0.1/32
...
Enable the daemon to start at boot up, then start the daemon:
This document describes dual-connected Linux hosts with static IP addresses.
Configure a host with the same /32 IP address on its loopback and uplinks so that both leafs advertise the same /32 regardless of the interface. Cumulus Linux relies on ECMP to load balance across the interfaces southbound, and an equal cost static route (see the configuration below) to load balance northbound.
The loopback hosts the primary service IP address to which you can bind services.
Configure the loopback and physical interfaces. In the example topology above:
server01 connects to leaf01 through eth1 and to leaf02 through eth2.
lo, eth1, and eth2 use the loopback IP address.
The post-up arping command forces the host to ARP as soon as its interface comes up. This allows the leaf to learn about the host as soon as possible.
The post-up ip route commands install a default route through one or both leafs if both swp1 and swp2 are up.
▼
Configuration
# The loopback network interface
auto lo
iface lo inet loopback
auto lo:1
iface lo:1
address 10.1.0.101/32
auto eth1
iface eth1
address 10.1.0.101/32
post-up for i in {1..3}; do arping -q -c 1 -w 0 -i eth1 10.0.0.11; sleep 1; done
post-up ip route add 0.0.0.0/0 nexthop via 10.0.0.11 dev eth1 onlink nexthop via 10.0.0.12 dev eth2 onlink || true
auto eth2
iface eth2
address 10.1.0.101/32
post-up for i in {1..3}; do arping -q -c 1 -w 0 -i eth2 10.0.0.12; sleep 1; done
post-up ip route add 0.0.0.0/0 nexthop via 10.0.0.11 dev eth1 onlink nexthop via 10.0.0.12 dev eth2 onlink || true
...
Install ifplugd
Install and use ifplugd, which modifies the behavior of the Linux routing table when an interface undergoes a link transition (carrier up/down). By default, the Linux kernel keeps routes up even when the physical interface is unavailable (NO-CARRIER).
After you install ifplugd, edit /etc/default/ifplugd as follows, where eth1 and eth2 are the interface names that your host uses to connect to the leafs.
For complete instructions to install ifplugd on Ubuntu, follow this guide.
Troubleshooting
Check if rdnbrd is Running
rdnbrd is the redistribute neighbor daemon. To check if the daemon is running, run the systemctl status rdnbrd.service command:
cumulus@leaf01:~$ systemctl status rdnbrd.service
* rdnbrd.service - Cumulus Linux Redistribute Neighbor Service
Loaded: loaded (/lib/systemd/system/rdnbrd.service; enabled)
Active: active (running) since Wed 2016-05-04 18:29:03 UTC; 1h 13min ago
Main PID: 1501 (python)
CGroup: /system.slice/rdnbrd.service
`-1501 /usr/bin/python /usr/sbin/rdnbrd -d
Change rdnbrd Configuration
To change the default configuration of rdnbrd, edit the /etc/rdnbrd.conf file, then run systemctl restart rdnbrd.service:
cumulus@leaf01:~$ sudo nano /etc/rdnbrd.conf
# syslog logging level CRITICAL, ERROR, WARNING, INFO, or DEBUG
loglevel = INFO
# TX an ARP request to known hosts every keepalive seconds
keepalive = 1
# If a host does not send an ARP reply for holdtime consider the host down
holdtime = 3
# Install /32 routes for each host into this table
route_table = 10
# Uncomment to enable ARP debugs on specific interfaces.
# Note that ARP debugs can be very chatty.
# debug_arp = swp1 swp2 swp3 br1
# If we already know the MAC for a host, unicast the ARP request. This is
# unusual for ARP (why ARP if you know the destination MAC) but we will be
# using ARP as a keepalive mechanism and do not want to broadcast so many ARPs
# if we do not have to. If a host cannot handle a unicasted ARP request, set
#
# Unicasting ARP requests is common practice (in some scenarios) for other
# networking operating systems so it is unlikely that you will need to set
# this to False.
unicast_arp_requests = True
cumulus@leaf01:~$ sudo systemctl restart rdnbrd.service
Set the Routing Table ID
The Linux kernel supports multiple routing tables and can use 0 through 255 table IDs; however it reserves tables 0, 253, 254 and 255, and uses 1 first. Therefore, rdnbrd only allows you to specify between 2 and 252. Cumulus Linux uses table ID 10, however you can set the ID to any value between 2 and 252. You can see all the tables specified here:
cumulus@leaf01:~$ cat /etc/iproute2/rt_tables
#
# reserved values
#
255 local
254 main
253 default
0 unspec
#
# local
#
#1 inr.ruhep
For BGP, run the vtysh show ip bgp neighbor <interface> advertised-routes command. For example:
cumulus@leaf01:~$ show ip bgp neighbor swp51 advertise-routes
BGP table version is 5, local router ID is 10.0.0.11
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*> 10.0.0.11/32 0.0.0.0 0 32768 i
*> 10.0.0.12/32 :: 0 65020 65012 i
*> 10.0.0.21/32 :: 0 65020 i
*> 10.0.0.22/32 :: 0 65020 i
Total number of prefixes 4
Verify the Kernel Routing Table
Use the following workflow to verify that the kernel routing table populates correctly and that routes import and advertise correctly:
Verify that ARP neighbor entries populate into the Kernel routing table 10.
cumulus@leaf01:~$ ip route show table 10
10.0.1.101 dev swp1 scope link
If these routes do not generate, verify that the rdnbrd daemon is running and check that the /etc/rdnbrd.conf file includes the correct table number.
Verify that routes import into FRR from the kernel routing table 10.
cumulus@leaf01:~$ sudo vtysh
leaf01# show ip route table
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, A - Babel, T - Table,
> - selected route, * - FIB route
T[10]>* 10.0.1.101/32 [19/0] is directly connected, swp1, 01:25:29
Both the > and * must be present so that table 10 routes install as preferred into the routing table. If the routes do not install, verify the imported distance of the locally imported kernel routes with the ip import 10 distance X command (where X is not less than the administrative distance of the routing protocol). If the distance is too low, routes learned from the protocol can overwrite the locally imported routes. Also, verify that the routes are in the kernel routing table.
Confirm that routes are in the BGP/OSPF database and that they advertise.
leaf01# show ip bgp
Considerations
Route Scale
Redistribute neighbor adds each ARP entry as a /32 host route into the routing table of all switches within a summarization domain. Make sure the number of hosts plus fabric routes is under the allocated hardware LPM table size of the switch according to the forwarding resource profile in use.
Uneven Traffic Distribution
Linux uses source layer 3 addresses only to load balance on most older distributions.
Silent Hosts Never Receive Traffic
Sometimes, freshly provisioned hosts that have yet to send traffic do not ARP for their default gateways. The post-up arping command in the /etc/network/interfaces file on the host takes care of this. If the host does not ARP, then rdnbrd on the leaf does not learn about the host.
FRRouting
Cumulus Linux uses FRR to provide the routing protocols for dynamic routing and supports the following routing protocols:
The FRR suite consists of various protocol-specific daemons and a protocol-independent daemon called zebra. Each of the protocol-specific daemons are responsible for running the relevant protocol and building the routing table based on the information exchanged.
It is not uncommon to have more than one protocol daemon running at the same time. For example, at the edge of an enterprise, protocols internal to an enterprise such as OSPF run alongside the protocols that connect an enterprise to the rest of the world such as BGP.
zebra is the daemon that resolves the routes provided by multiple protocols (including the static routes you specify) and programs these routes in the Linux kernel using netlink (in Linux). The FRRouting documentation defines zebra as the IP routing manager for FRR that provides kernel routing table updates, interface lookups, and redistribution of routes between different routing protocols.
Configure FRR
The information in this section does not apply if you use NVUE to configure your switch. NVUE manages FRR daemons and configuration automatically. These instructions are only applicable for users managing FRR directly through linux flat file configurations.
If you do not configure your system using NVUE, FRR does not start by default in Cumulus Linux. Before you run FRR, make sure you have enabled the relevant daemons that you intend to use (bgpd, ospfd, ospf6d, pimd, or pbrd) in the /etc/frr/daemons file.
The information in this section does not apply if you use NVUE to configure your switch. NVUE manages FRR daemons and configuration automatically. These instructions are only applicable for users managing FRR directly through linux flat file configurations.
After you enable the appropriate daemons, enable and start the FRR service:
All the routing protocol daemons (bgpd, ospfd, ospf6d, ripd, ripngd, isisd and pimd) are dependent on zebra. When you start FRR, systemd determines whether zebra is running; if zebra is not running, systemd starts zebra, then starts the dependent service, such as bgpd.
If you restart a service, its dependent services also restart. For example, running systemctl restart frr.service restarts any of the enabled routing protocol daemons that are running.
The information in this section does not apply if you use NVUE to configure your switch. NVUE manages FRR daemons and configuration automatically. These instructions are only applicable for users managing FRR directly through linux flat file configurations.
If you need to restore the FRR configuration to the default running configuration, delete the frr.conf file and restart the frr service.
Back up frr.conf (or any configuration files you want to remove) before proceeding.
Confirm that service integrated-vtysh-config is running.
Restarting FRR restarts all the routing protocol daemons that you enable and are running. NVIDIA recommends that you reboot the switch instead of restarting the FRR service to minimize traffic impact when redundant switches are present with MLAG.
Interface IP Addresses and VRFs
FRR inherits the IP addresses and any associated routing tables for the network interfaces from the /etc/network/interfaces file. This is the recommended way to define the addresses; do not create interfaces using FRR. For more information, see Configure IP Addresses and Virtual Routing and Forwarding - VRF.
vtysh Modal CLI
FRR provides a command-line interface (CLI) called vtysh for configuring and displaying protocol state. To start the CLI, run the sudo vtysh command:
cumulus@switch:~$ sudo vtysh
Hello, this is FRRouting (version 0.99.23.1+cl3u2).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
switch#
FRR provides different modes to the CLI and certain commands are only available within a specific mode. Configuration is available with the configure terminal command:
switch# configure terminal
switch(config)#
The prompt displays the current CLI mode. For example, when you run the interface-specific commands, the prompt changes to:
switch(config)# interface swp1
switch(config-if)#
When you run the routing protocol specific commands, the prompt changes:
? displays the list of available top-level commands:
switch(config-if)# ?
bandwidth Set bandwidth informational parameter
description Interface specific description
end End current mode and change to enable mode
exit Exit current mode and down to previous mode
ip IP Information
ipv6 IPv6 Information
isis IS-IS commands
link-detect Enable link detection on interface
list Print command list
mpls-te MPLS-TE specific commands
multicast Set multicast flag to interface
no Negate a command or set its defaults
ptm-enable Enable neighbor check with specified topology
quit Exit current mode and down to previous mode
shutdown Shutdown the selected interface
?-based completion is also available to see the parameters that a command takes:
switch(config-if)# bandwidth ?
<1-10000000> Bandwidth in kilobits
switch(config-if)# ip ?
address Set the IP address of an interface
irdp Alter ICMP Router discovery preference this interface
ospf OSPF interface commands
rip Routing Information Protocol
router IP router interface commands
To search for specific vtysh commands so that you can identify the correct syntax to use, run the sudo vtysh -c 'find <term>' command. For example, to show only commands that include mlag:
cumulus@leaf01:mgmt:~$ sudo vtysh -c 'find mlag'
(view) show ip pim [mlag] vrf all interface [detail|WORD] [json]
(view) show ip pim [vrf NAME] interface [mlag] [detail|WORD] [json]
(view) show ip pim [vrf NAME] mlag upstream [A.B.C.D [A.B.C.D]] [json]
(view) show ip pim mlag summary [json]
(view) show ip pim vrf all mlag upstream [json]
(view) show zebra mlag
(enable) [no$no] debug zebra mlag
(enable) debug pim mlag
(enable) no debug pim mlag
(enable) test zebra mlag <none$none|primary$primary|secondary$secondary>
(enable) show ip pim [mlag] vrf all interface [detail|WORD] [json]
(enable) show ip pim [vrf NAME] interface [mlag] [detail|WORD] [json]
(enable) show ip pim [vrf NAME] mlag upstream [A.B.C.D [A.B.C.D]] [json]
(enable) show ip pim mlag summary [json]
(enable) show ip pim vrf all mlag upstream [json]
(enable) show zebra mlag
(config) [no$no] debug zebra mlag
(config) debug pim mlag
(config) ip pim mlag INTERFACE role [primary|secondary] state [up|down] addr A.B.C.D
(config) no debug pim mlag
(config) no ip pim mlag
You can display the state at any level, including the top level. For example, to see the routing table as seen by zebra:
switch# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, T - Table,
> - selected route, * - FIB route
B>* 0.0.0.0/0 [20/0] via fe80::4638:39ff:fe00:c, swp29, 00:11:57
* via fe80::4638:39ff:fe00:52, swp30, 00:11:57
B>* 10.0.0.1/32 [20/0] via fe80::4638:39ff:fe00:c, swp29, 00:11:57
* via fe80::4638:39ff:fe00:52, swp30, 00:11:57
B>* 10.0.0.11/32 [20/0] via fe80::4638:39ff:fe00:5b, swp1, 00:11:57
B>* 10.0.0.12/32 [20/0] via fe80::4638:39ff:fe00:2e, swp2, 00:11:58
B>* 10.0.0.13/32 [20/0] via fe80::4638:39ff:fe00:57, swp3, 00:11:59
B>* 10.0.0.14/32 [20/0] via fe80::4638:39ff:fe00:43, swp4, 00:11:59
C>* 10.0.0.21/32 is directly connected, lo
B>* 10.0.0.51/32 [20/0] via fe80::4638:39ff:fe00:c, swp29, 00:11:57
* via fe80::4638:39ff:fe00:52, swp30, 00:11:57
B>* 172.16.1.0/24 [20/0] via fe80::4638:39ff:fe00:5b, swp1, 00:11:57
* via fe80::4638:39ff:fe00:2e, swp2, 00:11:57
B>* 172.16.3.0/24 [20/0] via fe80::4638:39ff:fe00:57, swp3, 00:11:59
* via fe80::4638:39ff:fe00:43, swp4, 00:11:59
To run the same command at a config level, prepend do:
switch(config-router)# do show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, T - Table,
> - selected route, * - FIB route
B>* 0.0.0.0/0 [20/0] via fe80::4638:39ff:fe00:c, swp29, 00:05:17
* via fe80::4638:39ff:fe00:52, swp30, 00:05:17
B>* 10.0.0.1/32 [20/0] via fe80::4638:39ff:fe00:c, swp29, 00:05:17
* via fe80::4638:39ff:fe00:52, swp30, 00:05:17
B>* 10.0.0.11/32 [20/0] via fe80::4638:39ff:fe00:5b, swp1, 00:05:17
B>* 10.0.0.12/32 [20/0] via fe80::4638:39ff:fe00:2e, swp2, 00:05:18
B>* 10.0.0.13/32 [20/0] via fe80::4638:39ff:fe00:57, swp3, 00:05:18
B>* 10.0.0.14/32 [20/0] via fe80::4638:39ff:fe00:43, swp4, 00:05:18
C>* 10.0.0.21/32 is directly connected, lo
B>* 10.0.0.51/32 [20/0] via fe80::4638:39ff:fe00:c, swp29, 00:05:17
* via fe80::4638:39ff:fe00:52, swp30, 00:05:17
B>* 172.16.1.0/24 [20/0] via fe80::4638:39ff:fe00:5b, swp1, 00:05:17
* via fe80::4638:39ff:fe00:2e, swp2, 00:05:17
B>* 172.16.3.0/24 [20/0] via fe80::4638:39ff:fe00:57, swp3, 00:05:18
* via fe80::4638:39ff:fe00:43, swp4, 00:05:18
To run single commands with vtysh, use the -c option:
cumulus@switch:~$ sudo vtysh -c 'sh ip route'
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, A - Babel,
> - selected route, * - FIB route
K>* 0.0.0.0/0 via 192.168.0.2, eth0
C>* 192.0.2.11/24 is directly connected, swp1
C>* 192.0.2.12/24 is directly connected, swp2
B>* 203.0.113.30/24 [200/0] via 192.0.2.2, swp1, 11:05:10
B>* 203.0.113.31/24 [200/0] via 192.0.2.2, swp1, 11:05:10
B>* 203.0.113.32/24 [200/0] via 192.0.2.2, swp1, 11:05:10
C>* 127.0.0.0/8 is directly connected, lo
C>* 192.168.0.0/24 is directly connected, eth0
If you try to configure a routing protocol that is not running, vtysh ignores those commands.
Next Hop Tracking
Routing daemons track the validity of next hops through notifications from the zebra daemon. For example, FRR uninstalls BGP routes that resolve to a next hop over a connected route in zebra when bgpd receives a next hop tracking (NHT) notification after zebra removes the connected route if the associated interface goes down.
The zebra daemon does not consider next hops that resolve to a default route as valid by default. You can configure NHT to consider a longest prefix match lookup for next hop addresses resolving to the default route as a valid next hop. The following example configures the default route to be valid for NHT in the default VRF:
cumulus@leaf01:~$ nv set vrf default router nexthop-tracking ipv4 resolved-via-default on
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh
leaf01# configure terminal
leaf01(config)# ip nht resolve-via-default
leaf01(config)# end
leaf01# write memory
leaf01# exit
cumulus@leaf01:~$
You can apply a route map to NHT for specific routing daemons to permit or deny routes from consideration as valid next hops. The following example applies ROUTEMAP1 to BGP, preventing NHT from considering next hops resolving to 10.0.0.0/8 in the default VRF as valid:
cumulus@leaf01:~$ nv set router policy prefix-list PREFIX1 type ipv4
cumulus@leaf01:~$ nv set router policy prefix-list PREFIX1 rule 1 match 10.0.0.0/8
cumulus@leaf01:~$ nv set router policy prefix-list PREFIX1 rule 1 action permit
cumulus@leaf01:~$ nv set router policy route-map ROUTEMAP1 rule 1 match ip-prefix-list PREFIX1
cumulus@leaf01:~$ nv set router policy route-map ROUTEMAP1 rule 1 action deny
cumulus@leaf01:~$ nv set router policy route-map ROUTEMAP1 rule 2 action permit
cumulus@leaf01:~$ nv set vrf default router nexthop-tracking ipv4 route-map ROUTEMAP1 protocol bgp
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh
leaf02# configure terminal
leaf02(config)# ip prefix-list PREFIX1 seq 1 permit 10.0.0.0/8
leaf02(config)# route-map ROUTEMAP1 deny 1
leaf02(config-route-map)# match ip address prefix-list PREFIX1
leaf02(config-route-map)# route-map ROUTEMAP1 permit 2
leaf02(config-route-map)# ip nht bgp route-map ROUTEMAP1
leaf02(config)# end
leaf01# write memory
leaf01# exit
cumulus@leaf01:~$
You can show tracked next hops with the following NVUE commands:
nv show vrf <vrf> router nexthop-tracking ipv4
nv show vrf <vrf> router nexthop-tracking ipv4 <ip-address>
nv show vrf <vrf> router nexthop-tracking ipv6
nv show vrf <vrf> router nexthop-tracking ipv6 <ip-address>
You can also run the vtysh show ip nht vrf <vrf> <ip-address> command.
Reload the FRR Configuration
The information in this section does not apply if you use NVUE to configure your switch. NVUE manages FRR daemons and configuration automatically. These instructions are only applicable for users managing FRR directly through linux flat file configurations.
If you make a change to your routing configuration, you need to reload FRR so your changes take place. FRR reload enables you to apply only the modifications you make to your FRR configuration, synchronizing its running state with the configuration in /etc/frr/frr.conf. This is useful for optimizing FRR automation in your environment or to apply changes made at runtime.
To reload your FRR configuration after you modify /etc/frr/frr.conf, run:
Examine the running configuration and verify that it matches the configuration in /etc/frr/frr.conf.
If the running configuration is not what you expect, submit a support request and supply the following information:
The current running configuration (run show running-config and output the contents to a file)
The contents of /etc/frr/frr.conf
The contents of /var/log/frr/frr-reload.log
FRR Logging
The information in this section does not apply if you use NVUE to configure your switch. NVUE manages FRR daemons and configuration automatically. These instructions are only applicable for users managing FRR directly through linux flat file configurations.
By default, Cumulus Linux configures FFR with syslog severity level 6 (informational). Log output writes to the /var/log/frr/frr.log file.
To write debug messages to the log file, you must run the log syslog debug command to configure FRR with syslog severity 7 (debug); otherwise, when you issue a debug command such as, debug bgp neighbor-events, no output goes to /var/log/frr/frr.log. However, when you manually define a log target with the log file /var/log/frr/debug.log command, FRR automatically defaults to severity 7 (debug) logging and the output logs to /var/log/frr/debug.log.
Considerations
Duplicate Hostnames
The switch can have two hostnames in the FRR configuration. For example:
cumulus@spine01:~$ sudo vtysh...
spine01# configure terminal
spine01(config)# hostname spine01-1
spine01-1(config)# do sh run
Building configuration...
Current configuration:
!
frr version 7.0+cl4u3
frr defaults datacenter
hostname spine01
hostname spine01-1
...
If you configure the same numbered BGP neighbor with both the neighbor x.x.x.x and neighbor swp# interface commands, two neighbor entries are present for the same IP address in the configuration. To correct this issue, update the configuration and restart the FRR service.
TCP Sockets and BGP Peering Sessions
The FRR startup configuration includes a setting for the maximum number of open files allowed. For BGP, open files include TCP sockets that BGP connections use. Either BGP speaker can start a BGP peering almost simultaneously; therefore, you can have two TCP sockets for a single BGP peer. These two sockets exist until the BGP protocol determines which socket to use, then the other socket closes.
The default setting of 1024 open files supports up to 512 BGP peering sessions. If you expect your network deployment to have more BGP peering sessions, you need to update this setting.
NVIDIA recommends you set the value to at least twice the maximum number of BGP peering sessions you expect.
To update the open files setting:
Edit the /lib/systemd/system/frr.service file and change the LimitNOFILE parameter. The following example sets the LimitNOFILE parameter to 4096.
You can use gRPC Network Management Interface (gNMI) to collect system resource, interface, and counter information from Cumulus Linux and export it to your own gNMI client.
Configure the gNMI Agent
The netq-agent package includes the gNMI agent, which it disables by default. To enable the gNMI agent:
The gNMI agent listens over port 9339. You can change the default port in case you use that port in another application. The /etc/netq/netq.yml file stores the configuration.
Use the following commands to adjust the settings:
Restart the NetQ Agent to incorporate the configuration changes:
cumulus@switch:~$ netq config restart agent
The gNMI agent relies on the data it collects from the NVUE service. For complete data collection with gNMI, you must enable the NVUE service. To check the status of the nvued service, run the sudo systemctl status nvued.service command:
cumulus@switch:mgmt:~$ sudo systemctl status nvued.service
● nvued.service - NVIDIA User Experience Daemon
Loaded: loaded (/lib/systemd/system/nvued.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2023-03-09 20:00:17 UTC; 6 days ago
NVIDIA recommends that you collect data with both the gNMI and NetQ agents. However, if you do not want to collect data with both agents or you are not streaming data to NetQ, you can disable the NetQ agent. Cumulus Linux then sents data only to the gNMI agent.
You cannot disable both the NetQ and gNMI agent. If you enable both agents on Cumulus Linux and a NetQ server is unreachable, the switch does not send the data to gNMI from the following models:
openconfig-interfaces
openconfig-if-ethernet
openconfig-if-ethernet-ext
openconfig-system
nvidia-if-ethernet-ext
WJH, openconfig-platform, and openconfig-lldp data continue streaming to gNMI in this state. If you are only using gNMI and a NetQ telemetry server does not exist, disable the NetQ agent by setting opta-enable to false.
gNMI clients can also use the following model for extended Ethernet counters:
▼
nvidia-if-ethernet-ext
module nvidia-if-ethernet-counters-ext {
// xPath --> /interfaces/interface[name=*]/ethernet/counters/state/
namespace "http://nvidia.com/yang/nvidia-ethernet-counters";
prefix "nvidia-if-ethernet-counters-ext";
// import some basic types
import openconfig-interfaces { prefix oc-if; }
import openconfig-if-ethernet { prefix oc-eth; }
import openconfig-yang-types { prefix oc-yang; }
revision "2021-10-12" {
description
"Initial revision";
reference "1.0.0.";
}
grouping ethernet-counters-ext {
leaf alignment-error {
type oc-yang:counter64;
}
leaf in-acl-drops {
type oc-yang:counter64;
}
leaf in-buffer-drops {
type oc-yang:counter64;
}
leaf in-dot3-frame-errors {
type oc-yang:counter64;
}
leaf in-dot3-length-errors {
type oc-yang:counter64;
}
leaf in-l3-drops {
type oc-yang:counter64;
}
leaf in-pfc0-packets {
type oc-yang:counter64;
}
leaf in-pfc1-packets {
type oc-yang:counter64;
}
leaf in-pfc2-packets {
type oc-yang:counter64;
}
leaf in-pfc3-packets {
type oc-yang:counter64;
}
leaf in-pfc4-packets {
type oc-yang:counter64;
}
leaf in-pfc5-packets {
type oc-yang:counter64;
}
leaf in-pfc6-packets {
type oc-yang:counter64;
}
leaf in-pfc7-packets {
type oc-yang:counter64;
}
leaf out-non-q-drops {
type oc-yang:counter64;
}
leaf out-pfc0-packets {
type oc-yang:counter64;
}
leaf out-pfc1-packets {
type oc-yang:counter64;
}
leaf out-pfc2-packets {
type oc-yang:counter64;
}
leaf out-pfc3-packets {
type oc-yang:counter64;
}
leaf out-pfc4-packets {
type oc-yang:counter64;
}
leaf out-pfc5-packets {
type oc-yang:counter64;
}
leaf out-pfc6-packets {
type oc-yang:counter64;
}
leaf out-pfc7-packets {
type oc-yang:counter64;
}
leaf out-q0-wred-drops {
type oc-yang:counter64;
}
leaf out-q1-wred-drops {
type oc-yang:counter64;
}
leaf out-q2-wred-drops {
type oc-yang:counter64;
}
leaf out-q3-wred-drops {
type oc-yang:counter64;
}
leaf out-q4-wred-drops {
type oc-yang:counter64;
}
leaf out-q5-wred-drops {
type oc-yang:counter64;
}
leaf out-q6-wred-drops {
type oc-yang:counter64;
}
leaf out-q7-wred-drops {
type oc-yang:counter64;
}
leaf out-q8-wred-drops {
type oc-yang:counter64;
}
leaf out-q9-wred-drops {
type oc-yang:counter64;
}
leaf out-q-drops {
type oc-yang:counter64;
}
leaf out-q-length {
type oc-yang:counter64;
}
leaf out-wred-drops {
type oc-yang:counter64;
}
leaf symbol-errors {
type oc-yang:counter64;
}
leaf out-tx-fifo-full {
type oc-yang:counter64;
}
}
augment "/oc-if:interfaces/oc-if:interface/oc-eth:ethernet/" +
"oc-eth:state/oc-eth:counters" {
uses ethernet-counters-ext;
}
}
Collect WJH Data with gNMI
You can export WJH data from the NetQ agent to your own gNMI client.
The client must use the following YANG model as a reference:
▼
nvidia-if-wjh-drop-aggregate
module nvidia-wjh {
// Entrypoint /oc-if:interfaces/oc-if:interface
//
// xPath L1 --> interfaces/interface[name=*]/wjh/aggregate/l1
// xPath L2 --> /interfaces/interface[name=*]/wjh/aggregate/l2/reasons/reason[id=*][severity=*]
// xPath Router --> /interfaces/interface[name=*]/wjh/aggregate/router/reasons/reason[id=*][severity=*]
// xPath Tunnel --> /interfaces/interface[name=*]/wjh/aggregate/tunnel/reasons/reason[id=*][severity=*]
// xPath Buffer --> /interfaces/interface[name=*]/wjh/aggregate/buffer/reasons/reason[id=*][severity=*]
// xPath ACL --> /interfaces/interface[name=*]/wjh/aggregate/acl/reasons/reason[id=*][severity=*]
import openconfig-interfaces { prefix oc-if; }
namespace "http://nvidia.com/yang/what-just-happened-config";
prefix "nvidia-wjh";
revision "2021-10-12" {
description
"Initial revision";
reference "1.0.0.";
}
augment "/oc-if:interfaces/oc-if:interface" {
uses interfaces-wjh;
}
grouping interfaces-wjh {
description "Top-level grouping for What-just happened data.";
container wjh {
container aggregate {
container l1 {
container state {
leaf drop {
type string;
description "Drop list based on wjh-drop-types module encoded in JSON";
}
}
}
container l2 {
uses reason-drops;
}
container router {
uses reason-drops;
}
container tunnel {
uses reason-drops;
}
container acl {
uses reason-drops;
}
container buffer {
uses reason-drops;
}
}
}
}
grouping reason-drops {
container reasons {
list reason {
key "id severity";
leaf id {
type leafref {
path "../state/id";
}
description "reason ID";
}
leaf severity {
type leafref {
path "../state/severity";
}
description "Reason severity";
}
container state {
leaf id {
type uint32;
description "Reason ID";
}
leaf name {
type string;
description "Reason name";
}
leaf severity {
type string;
mandatory "true";
description "Reason severity";
}
leaf drop {
type string;
description "Drop list based on wjh-drop-types module encoded in JSON";
}
}
}
}
}
}
module wjh-drop-types {
namespace "http://nvidia.com/yang/what-just-happened-config-types";
prefix "wjh-drop-types";
container l1-aggregated {
uses l1-drops;
}
container l2-aggregated {
uses l2-drops;
}
container router-aggregated {
uses router-drops;
}
container tunnel-aggregated {
uses tunnel-drops;
}
container acl-aggregated {
uses acl-drops;
}
container buffer-aggregated {
uses buffer-drops;
}
grouping reason-key {
leaf id {
type uint32;
mandatory "true";
description "reason ID";
}
leaf severity {
type string;
mandatory "true";
description "Severity";
}
}
grouping reason_info {
leaf reason {
type string;
mandatory "true";
description "Reason name";
}
leaf drop_type {
type string;
mandatory "true";
description "reason drop type";
}
leaf ingress_port {
type string;
mandatory "true";
description "Ingress port name";
}
leaf ingress_lag {
type string;
description "Ingress LAG name";
}
leaf egress_port {
type string;
description "Egress port name";
}
leaf agg_count {
type uint64;
description "Aggregation count";
}
leaf severity {
type string;
description "Severity";
}
leaf first_timestamp {
type uint64;
description "First timestamp";
}
leaf end_timestamp {
type uint64;
description "End timestamp";
}
}
grouping packet_info {
leaf smac {
type string;
description "Source MAC";
}
leaf dmac {
type string;
description "Destination MAC";
}
leaf sip {
type string;
description "Source IP";
}
leaf dip {
type string;
description "Destination IP";
}
leaf proto {
type uint32;
description "Protocol";
}
leaf sport {
type uint32;
description "Source port";
}
leaf dport {
type uint32;
description "Destination port";
}
}
grouping l1-drops {
description "What-just happened drops.";
leaf ingress_port {
type string;
description "Ingress port";
}
leaf is_port_up {
type boolean;
description "Is port up";
}
leaf port_down_reason {
type string;
description "Port down reason";
}
leaf description {
type string;
description "Description";
}
leaf state_change_count {
type uint64;
description "State change count";
}
leaf symbol_error_count {
type uint64;
description "Symbol error count";
}
leaf crc_error_count {
type uint64;
description "CRC error count";
}
leaf first_timestamp {
type uint64;
description "First timestamp";
}
leaf end_timestamp {
type uint64;
description "End timestamp";
}
leaf timestamp {
type uint64;
description "Timestamp";
}
}
grouping l2-drops {
description "What-just happened drops.";
uses reason_info;
uses packet_info;
}
grouping router-drops {
description "What-just happened drops.";
uses reason_info;
uses packet_info;
}
grouping tunnel-drops {
description "What-just happened drops.";
uses reason_info;
uses packet_info;
}
grouping acl-drops {
description "What-just happened drops.";
uses reason_info;
uses packet_info;
leaf acl_rule_id {
type uint64;
description "ACL rule ID";
}
leaf acl_bind_point {
type uint32;
description "ACL bind point";
}
leaf acl_name {
type string;
description "ACL name";
}
leaf acl_rule {
type string;
description "ACL rule";
}
}
grouping buffer-drops {
description "What-just happened drops.";
uses reason_info;
uses packet_info;
leaf traffic_class {
type uint32;
description "Traffic Class";
}
leaf original_occupancy {
type uint32;
description "Original occupancy";
}
leaf original_latency {
type uint64;
description "Original latency";
}
}
}
Supported Features
The gNMI Agent supports Capabilities and STREAM subscribe requests for WJH events.
WJH Drop Reasons
The data that NetQ sends to the gNMI agent is in the form of WJH drop reasons. The SDK generates the drop reasons and Cumulus Linux stores them in the /usr/etc/wjh_lib_conf.xml file. Use this file as a guide to filter for specific reason types (L1, ACL, and so on), reason IDs, or event severities.
L1 Drop Reasons
Reason ID
Reason
Description
10021
Port admin down
Validate port configuration
10022
Auto-negotiation failure
Set port speed manually, disable auto-negotiation
10023
Logical mismatch with peer link
Check cable or transceiver
10024
Link training failure
Check cable or transceiver
10025
Peer is sending remote faults
Replace cable or transceiver
10026
Bad signal integrity
Replace cable or transceiver
10027
Cable or transceiver is not supported
Use supported cable or transceiver
10028
Cable or transceiver is unplugged
Plug cable or transceiver
10029
Calibration failure
Check cable or transceiver
10030
Cable or transceiver bad status
Check cable or transceiver
10031
Other reason
Other L1 drop reason
L2 Drop Reasons
Reason ID
Reason
Severity
Description
201
MLAG port isolation
Notice
Expected behavior
202
Destination MAC is reserved (DMAC=01-80-C2-00-00-0x)
Error
Bad packet received from the peer
203
VLAN tagging mismatch
Error
Validate the VLAN tag configuration on both ends of the link
204
Ingress VLAN filtering
Error
Validate the VLAN membership configuration on both ends of the link
205
Ingress spanning tree filter
Notice
Expected behavior
206
Unicast MAC table action discard
Error
Validate MAC table for this destination MAC
207
Multicast egress port list is empty
Warning
Validate why IGMP join or multicast router port does not exist
208
Port loopback filter
Error
Validate MAC table for this destination MAC
209
Source MAC is multicast
Error
Bad packet received from peer
210
Source MAC equals destination MAC
Error
Bad packet received from peer
Router Drop Reasons
Reason ID
Reason
Severity
Description
301
Non-routable packet
Notice
Expected behavior
302
Blackhole route
Warning
Validate routing table for this destination IP
303
Unresolved neighbor or next hop
Warning
Validate ARP table for the neighbor or next hop
304
Blackhole ARP or neighbor
Warning
Validate ARP table for the next hop
305
IPv6 destination in multicast scope FFx0:/16
Notice
Expected behavior - packet is not routable
306
IPv6 destination in multicast scope FFx1:/16
Notice
Expected behavior - packet is not routable
307
Non-IP packet
Notice
Destination MAC is the router, packet is not routable
308
Unicast destination IP but multicast destination MAC
Error
Bad packet received from the peer
309
Destination IP is loopback address
Error
Bad packet received from the peer
310
Source IP is multicast
Error
Bad packet received from the peer
311
Source IP is in class E
Error
Bad packet received from the peer
312
Source IP is loopback address
Error
Bad packet received from the peer
313
Source IP is unspecified
Error
Bad packet received from the peer
314
Checksum or IPver or IPv4 IHL too short
Error
Bad cable or bad packet received from the peer
315
Multicast MAC mismatch
Error
Bad packet received from the peer
316
Source IP equals destination IP
Error
Bad packet received from the peer
317
IPv4 source IP is limited broadcast
Error
Bad packet received from the peer
318
IPv4 destination IP is local network (destination=0.0.0.0/8)
Error
Bad packet received from the peer
320
Ingress router interface is disabled
Warning
Validate your configuration
321
Egress router interface is disabled
Warning
Validate your configuration
323
IPv4 routing table (LPM) unicast miss
Warning
Validate routing table for this destination IP
324
IPv6 routing table (LPM) unicast miss
Warning
Validate routing table for this destination IP
325
Router interface loopback
Warning
Validate the interface configuration
326
Packet size is larger than router interface MTU
Warning
Validate the router interface MTU configuration
327
TTL value is too small
Warning
Actual path is longer than the TTL
Tunnel Drop Reasons
Reason ID
Reason
Severity
Description
402
Overlay switch - Source MAC is multicast
Error
The peer sent a bad packet
403
Overlay switch - Source MAC equals destination MAC
Error
The peer sent a bad packet
404
Decapsulation error
Error
The peer sent a bad packet
ACL Drop Reasons
Reason ID
Reason
Severity
Description
601
Ingress port ACL
Notice
Validate ACL configuration
602
Ingress router ACL
Notice
Validate ACL configuration
603
Egress router ACL
Notice
Validate ACL configuration
604
Egress port ACL
Notice
Validate ACL configuration
Buffer Drop Reasons
Reason ID
Reason
Severity
Description
503
Tail drop
Warning
Monitor network congestion
504
WRED
Warning
Monitor network congestion
505
Port TC congestion threshold crossed
Notice
Monitor network congestion
506
Packet latency threshold crossed
Notice
Monitor network congestion
gNMI Client Requests
You can use your gNMI client on a host to request capabilities and data to which the Agent subscribes. The examples below use the gNMIc client..
The following example shows a gNMIc STREAM request for WJH data:
BGP is the routing protocol that runs the Internet. It manages how packets get routed from network to network by exchanging routing and reachability information.
BGP is an increasingly popular protocol for use in the data center as it lends itself well to the rich interconnections in a Clos topology. RFC 7938 provides further details about using BGP in the data center.
How Does BGP Work?
BGP directs packets between autonomous systems (AS), which are a set of routers under a common administration.
Each router maintains a routing table that controls how the switch forwards packets. The BGP process on the router generates information in the routing table based on information coming from other routers and from information in the RIB. The RIB stores routes and continually updates the routing table as changes occur.
Autonomous System
BGP treats each independently managed enterprise and service provider as an autonomous system, responsible for a set of network addresses. Each such autonomous system has a unique number called an ASN. A central authority (ICANN) hands out ASNs but numbers between 64512 and 65535 are for private use. When you use BGP within the data center, you must either use this number space or the single ASN you own.
The ASN is central to how BGP builds a forwarding topology. A BGP route advertisement carries with it not only the ASN of the originator, but also the list of ASNs that this route advertisement passes through. When forwarding a route advertisement, a BGP speaker adds itself to this list. The AS path includes the list of ASNs. BGP uses the AS path to detect and avoid loops.
In a two-tier leaf and spine environment, you can use auto BGP to generate 32-bit ASNs automatically so that you do not have to think about which numbers to configure. Auto BGP helps build optimal ASN configurations in your data center to avoid suboptimal routing and path hunting, which occurs when you assign the wrong spine ASNs. Auto BGP makes no changes to standard BGP behavior or configuration.
Auto BGP assigns private ASNs in the range 4200000000 through 4294967294. This is the private space that RFC 6996 defines. Each leaf has a random and unique value in the range 4200000001 through 4294967294. Each spine has the value 4200000000; the first number in the range. For information about configuring auto BGP, refer to Basic BGP Configuration.
Use auto BGP in new deployments to avoid conflicting ASNs in an existing configuration.
It is not necessary to use auto BGP across all switches in your configuration. For example, you can use auto BGP to configure one switch but set ASNs manually to other switches.
Use auto BGP in two-tier spine and leaf networks. Using auto BGP in three-tier networks with super spines can result in incorrect ASN assignments.
The leaf keyword generates the ASN based on a hash of the switch MAC address. The ASN assigned can change after a switch replacement.
You can configure auto BGP with NVUE.
eBGP and iBGP
When you use BGP to peer between autonomous systems, the peering is eBGP. When you use BGP within an autonomous system, the peering is iBGP. eBGP peers have different ASNs while iBGP peers have the same ASN.
The heart of the protocol is the same when used as eBGP or iBGP but there is a key difference in the protocol behavior between eBGP and iBGP. To prevent loops, an iBGP speaker does not forward routing information learned from one iBGP peer to another iBGP peer. eBGP prevents loops using the AS_Path attribute.
You need to peer all iBGP speakers with each other in a full mesh. In a large network, this requirement can become unscalable. The most popular method to scale iBGP networks is to introduce a route reflector.
BGP Path Selection
BGP is a path-vector routing algorithm that does not rely on a single routing metric to determine the lowest cost route, unlike IGPs like OSPF.
The BGP path selection algorithm looks at multiple factors to determine which path is best. Cumulus Linux enables BGP multipath by default so that multiple equal cost routes install in the routing table but only a single route advertises to BGP peers.
The order of the BGP algorithm process is as follows:
Highest Weight: Weight is a value from 0 to 65535. BGP does not carry the weight in an update but uses it locally to influence the best path selection. Locally generated routes have a weight of 32768.
Highest Local Preference: Only iBGP neighbors exchange local preference. BGP assigns routes from eBGP peers a local preference of 0. Whereas weight makes route selections without sending additional information to peers, local preference influences routing to iBGP peers.
Locally Originated Routes: Any route that the local switch places into BGP is the selected best. This includes static routes, aggregate routes and redistributed routes.
Shortest AS Path: BGP selects the path with the fewest number of ASN hops.
Origin Check: Preference to routes with an IGP origin (routes you place into BGP with a network statement) over incomplete origins (routes you place into BGP through redistribution).
Lowest MED: BGP sends the Multi-Exit Discriminator (MED) to eBGP peers to indicate a preference on how traffic enters an AS. BGP exchanges the MED from an eBGP peer with iBGP peers but resets to a value of 0 before advertising a prefix to another AS.
eBGP Routes: The switch uses the route from an eBGP peer over a route learned from an iBGP peer.
Lowest IGP Cost to the Next Hop: The route with the lowest IGP metric to reach the BGP next hop.
iBGP ECMP over eBGP ECMP: If you configure BGP multipath, the switch uses equal iBGP routes over equal eBGP routes (unless you also configure as-path multipath-relax.
Oldest Route: The switch uses the oldest route in the BGP table.
Lowest Router ID: The switch uses the route from the peer with the lowest Router ID attribute. If the route is from a route reflector, the switch uses the ORIGINATOR_ID attribute for comparison.
Shortest Route Reflector Cluster List: If a route passes through multiple route reflectors, the switch uses the route with the shortest route reflector cluster list.
Highest Peer IP Address: The switch uses the route from the peer with the highest IP address.
To see the reason Cumulus Linux selects one path over another, run the vtysh show ip bgp command or the net show bgp command.
When you use BGP multipath, if multiple paths are equal, BGP still selects a single best path to advertise to peers. This path shows as best with the reason, although BGP can install multiple paths into the routing table.
BGP Unnumbered
Historically, peers connect over IPv4 and TCP port 179, and after they establish a session, they exchange prefixes. When a BGP peer advertises an IPv4 prefix, it must include an IPv4 next hop address, which is the address of the advertising router. This requires each BGP peer to have an IPv4 address, which in a large network can consume a lot of address space and can require a separate IP address for each peer-facing interface.
The BGP unnumbered standard in RFC 5549, uses ENHE and does not require that you advertise an IPv4 prefix together with an IPv4 next hop. You can configure BGP peering between your Cumulus Linux switches and exchange IPv4 prefixes without having to configure an IPv4 address on each switch; BGP uses unnumbered interfaces.
The next hop address for each prefix is an IPv6 link-local address, which BGP assigns automatically to each interface. Using the IPv6 link-local address as a next hop instead of an IPv4 unicast address, BGP unnumbered saves you from having to configure IPv4 addresses on each interface.
When you use BGP unnumbered, BGP learns the prefixes, calculates the routes and installs them in IPv4 AFI to IPv6 AFI format. ENHE in Cumulus Linux does not install routes into the kernel in IPv4 prefix to IPv6 next hop format. For link-local peerings that you enable using IPv6 neighbor discovery router advertisements, BGP converts an IPv6 next hop into an IPv4 link-local address. It then installs a static neighbor entry for this IPv4 link-local address with the MAC address that it derives from the link-local address of the other end.
If you assign an IPv4 /30 or /31 IP address to the interface, BGP uses IPv4 peering instead of IPv6 link-local peering.
BGP unnumbered only works with two switches at a time (with point-to-point links).
The IPv6 implementation on the peering device uses the MAC address as the interface ID when assigning the IPv6 link-local address, as suggested by RFC 4291.
Every router or end host must have an IPv4 address to complete a traceroute of IPv4 addresses. In this case, the IPv4 address used is that of the loopback device. Even if extended next hop encoding (ENHE) is not used in the data center, link addresses are not typically advertised because they take up valuable FIB resources and also expose an additional attack vector for intruders to use to either break in or engage in DDOS attacks. Assigning an IP address to the loopback device is essential.
This section describes how to configure BGP using either BGP numbered or BGP unnumbered. With BGP unnumbered, you can set up BGP peering between your Cumulus Linux switches and exchange IPv4 prefixes without having to configure an IPv4 address on each switch.
BGP unnumbered simplifies configuration. NVIDIA recommends you use BGP unnumbered for data center deployments.
BGP Numbered
To configure BGP numbered on a BGP node, you need to:
Assign an ASN to identify this BGP node. In a two-tier leaf and spine configuration, you can use auto BGP, where Cumulus Linux assigns an ASN automatically.
If necessary, specify a router ID. NVUE automatically assigns the loopback address of the switch to be the router ID. FRR automatically assigns the router ID to be the loopback address or the highest IPv4 address for the interface. If you do not have a loopback address configured or want to use a specific router ID, set the router ID globally or per VRF.
Specify where to distribute routing information by providing the IP address and ASN of the neighbor.
For BGP numbered, this is the IP address of the interface between the two peers; the interface must be a layer 3 access port.
The ASN can be a number, or internal for a neighbor in the same AS or external for a neighbor in a different AS.
Specify which prefixes to originate from this BGP node.
Identify the BGP node by assigning an ASN.
To assign an ASN manually:
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
To use auto BGP to assign an ASN automatically on the leaf:
cumulus@leaf01:~$ nv set router bgp autonomous-system leaf
The auto BGP leaf keyword is only used to configure the ASN. The configuration files and nv show commands display the AS number.
BGP automatically assigns the loopback address of the switch to be the router ID. If you do not have a loopback address configured or you do not want to use the loopback address as the router ID, you must assign the router ID either globally with the nv set router bgp router-id command or in a VRF with the nv set vrf <vrf> router bgp router-id command.
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
Specify the BGP neighbor to which you want to distribute routing information.
For BGP to advertise IPv6 prefixes, you need to run an additional command to activate the BGP neighbor under the IPv6 address family. Cumulus Linux enables the IPv4 address family by default; you do not need to run the activate command for IPv4 route exchange.
cumulus@leaf01:~$ nv set vrf default router bgp neighbor 2001:db8:0002::0a00:0002 remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp neighbor 2001:db8:0002::0a00:0002 address-family ipv6-unicast enable on
cumulus@spine01:~$ nv set router bgp autonomous-system 65199
To use auto BGP to assign an ASN automatically on the spine:
cumulus@spine01:~$ nv set router bgp autonomous-system spine
The auto BGP spine keyword is only used to configure the ASN. The configuration files and nv show commands display the AS number.
BGP automatically assigns the loopback address of the switch to be the router ID. If you do not have a loopback address configured or you do not want to use the loopback address as the router ID, you must assign the router ID either globally with the nv set router bgp router-id command or in a VRF with the nv set vrf <vrf> router bgp router-id command.
cumulus@spine01:~$ nv set router bgp router-id 10.10.10.101
Specify the BGP neighbor to which you want to distribute routing information.
For BGP to advertise IPv6 prefixes, you need to run an additional command to activate the BGP neighbor under the IPv6 address family. Cumulus Linux enables the IPv4 address family by default; you do not need to run the activate command for IPv4 route exchange.
cumulus@spine01:~$ nv set vrf default router bgp neighbor 2001:db8:0002::0a00:1 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp neighbor address-family ipv6-unicast 2001:db8:0002::0a00:1 enable on
Identify the BGP node by assigning an ASN and, if necessary, the router ID.
BGP automatically assigns the router ID using the loopback address or the highest IPv4 address for the interface. If you want to assign a specific IPv4 address for the router ID, add the router ID globally or per VRF.
For BGP to advertise IPv6 prefixes, you need to run an additional command to activate the BGP neighbor under the IPv6 address family. Cumulus Linux enables the IPv4 address family by default; you do not need to run the activate command for IPv4 route exchange.
Identify the BGP node by assigning an ASN and, if necessary, the router ID.
BGP automatically assigns the router ID using the loopback address or the highest IPv4 address for the interface. If you want to assign a specific IPv4 address for the router ID, add the router ID globally or per VRF.
For BGP to advertise IPv6 prefixes, you need to run an additional command to activate the BGP neighbor under the IPv6 address family. Cumulus Linux enables the IPv4 address family by default; you do not need to run the activate command for IPv4 route exchange.
When using auto BGP, there are no references to leaf or spine in the configurations. Auto BGP determines the ASN for the system and configures it using standard vtysh commands.
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
The following example commands show a basic BGP unnumbered configuration for two switches, leaf01 and spine01, which are eBGP peers.
The only difference between a BGP unnumbered configuration and the BGP numbered configuration shown above is that the BGP neighbor is as an interface (instead of an IP address). You do not need to configure an IP address on the interface between the two peers on each side.
For BGP to advertise IPv6 prefixes, you need to run an additional command to activate the BGP neighbor under the IPv6 address family. Cumulus Linux enables the IPv4 address family by default; you do not need to run the activate command for IPv4 route exchange.
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv6-unicast enable on
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv6-unicast network 2001:db8::1/128
cumulus@leaf01:~$ nv config apply
After you run nv config save, the NVUE Commands create the following configuration snippet in the /etc/nvue.d/startup.yaml file:
cumulus@spine01:~$ nv set router bgp autonomous-system 65199
cumulus@spine01:~$ nv set router bgp router-id 10.10.10.101
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp1 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.101/32
cumulus@spine01:~$ nv config apply
For BGP to advertise IPv6 prefixes, you need to run an additional command to activate the BGP neighbor under the IPv6 address family. Cumulus Linux enables the IPv4 address family by default; you do not need to run the activate command for IPv4 route exchange.
cumulus@spine01:~$ nv set router bgp autonomous-system 65199
cumulus@spine01:~$ nv set router bgp router-id 10.10.10.101
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp1 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp address-family ipv6-unicast enable on
cumulus@spine01:~$ nv set vrf default router bgp address-family ipv6-unicast network 2001:db8::101/128
cumulus@spine01:~$ nv config apply
After you run nv config save, the NVUE Commands create the following configuration snippet in the /etc/nvue.d/startup.yaml file:
For BGP to advertise IPv6 prefixes, you need to run an additional command to activate the BGP neighbor under the IPv6 address family. Cumulus Linux enables the IPv4 address family by default; you do not need to run the activate command for IPv4 route exchange.
For BGP to advertise IPv6 prefixes, you need to run an additional command to activate the BGP neighbor under the IPv6 address family. Cumulus Linux enables the IPv4 address family by default; you do not need to run the activate command for IPv4 route exchange.
To verify that the switch can see its BGP neighbors, run the net show bgp summary command or the vtysh show ip bgp summary command:
cumulus@leaf01:mgmt:~$ net show bgp summary
show bgp ipv4 unicast summary
=============================
BGP router identifier 10.10.10.1, local AS number 65101 vrf-id 0
BGP table version 3
RIB entries 5, using 1000 bytes of memory
Peers 1, using 23 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
spine01(swp51) 4 65199 54 55 0 0 0 00:02:29 1 3
Total number of neighbors 1
...
cumulus@spine01:mgmt:~$ net show bgp summary
show bgp ipv4 unicast summary
=============================
BGP router identifier 10.10.10.101, local AS number 65199 vrf-id 0
BGP table version 3
RIB entries 5, using 1000 bytes of memory
Peers 1, using 23 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
leaf01(swp1) 4 65101 73 73 0 0 0 00:03:25 2 3
Total number of neighbors 1
...
To verify that you can see the prefixes of the other neighbor in the routing table, run the net show route bgp command or the vtysh show ip route command.
cumulus@leaf01:mgmt:~$ net show route bgp
RIB entry for bgp
=================
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
B>* 10.10.10.101/32 [20/0] via fe80::4638:39ff:fe00:1, swp51, weight 1, 00:00:51
cumulus@spine01:mgmt:~$ net show route bgp
RIB entry for bgp
=================
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued, r - rejected, b - backup
t - trapped, o - offload failure
B>* 10.1.10.0/24 [20/0] via fe80::4638:39ff:fe00:2, swp1, weight 1, 00:00:11
B>* 10.10.10.1/32 [20/0] via fe80::4638:39ff:fe00:2, swp1, weight 1, 00:00:11
This section describes optional configuration. The steps provided in this section assume that you already configured basic BGP as described in Basic BGP Configuration.
Peer Groups
Instead of specifying properties of each individual peer, you can define one or more peer groups and associate all the attributes common to that peer session to a peer group. You need to attach a peer to a peer group one time; it then inherits all address families activated for that peer group.
If the peer you want to add to a group already exists in the BGP configuration, delete it first, than add it to the peer group.
The following example commands create a peer group called SPINE that includes two external peers.
If you unset a peer group, make sure that it is not applied to any neighbors. If the peer group is applied to neighbors, configure all parameters, such as the remote AS, directly on the neighbors before removing the peer group.
BGP Dynamic Neighbors
BGP dynamic neighbors provides BGP peering to remote neighbors within a specified range of IPv4 or IPv6 addresses for a BGP peer group. You can configure each range as a subnet IP address.
After you configure the dynamic neighbors, a BGP speaker can listen for, and form peer relationships with, any neighbor that is in the IP address range and maps to a peer group. You can also limit the number of dynamic peers. The default value is 100.
The following example commands configure BGP peering to remote neighbors within the address range 10.0.1.0/24 for the peer group SPINE and limit the number of dynamic peers to 5.
The peer group must already exist otherwise the configuration does not apply.
The eBGP multihop option lets you use BGP to exchange routes with an external peer that is more than one hop away.
The following example command configures Cumulus Linux to establish a connection between two eBGP peers that are not directly connected and sets the maximum number of hops used to reach a eBGP peer to 1.
You can use the TTL security hop count option to prevent attacks against eBGP, such as denial of service (DoS) attacks.
By default, BGP messages to eBGP neighbors have an IP time-to-live (TTL) of 1, which requires the peer to be directly connected, otherwise, the packets expire along the way. You can adjust the TTL with the eBGP multihop option. An attacker can adjust the TTL of packets so that they look like they originate from a directly connected peer.
The BGP TTL security hops option inverts the direction in which BGP counts the TTL. Instead of accepting only packets with a TTL of 1, Cumulus Linux accepts BGP messages with a TTL greater than or equal to 255 minus the specified hop count.
When you use TTL security, you do not need eBGP multihop.
The following command example sets the TTL security hop count value to 200:
When you configure ttl-security hops on a peer group instead of a specific neighbor, FRR does not add it to either the running configuration or to the /etc/frr/frr.conf file. To work around this issue, add ttl-security hops to individual neighbors instead of the peer group.
Enabling ttl-security hops does not program the hardware with relevant information. Cumulus Linux forwards frames to the CPU and then drops them. Use the NVUE Command to explicitly add the relevant entry to hardware. For more information about ACLs, see Netfilter - ACLs.
MD5-enabled BGP Neighbors
You can authenticate your BGP peer connection to prevent interference with your routing tables.
To enable MD5 authentication for BGP peers, set the same password on each peer.
The following example commands set the password mypassword on BGP peers leaf01 and spine01:
You can confirm the configuration with the vtysh show ip bgp neighbor <neighbor> command or the net show bgp neighbor <neighbor> command.
▼
example
The following example shows that Cumulus Linux establishes a session with the peer. The output shows Peer Authentication Enabled towards the end.
cumulus@spine01:~$ sudo vtysh
...
spine01# show ip bgp neighbor swp1
BGP neighbor on swp1: fe80::2294:15ff:fe02:7bbf, remote AS 65101, local AS 65199, external link
Hostname: leaf01
BGP version 4, remote router ID 10.10.10.1, local router ID 10.10.10.101
BGP state = Established, up for 00:00:39
Last read 00:00:00, Last write 00:00:00
Hold time is 9, keepalive interval is 3 seconds
Neighbor capabilities:
4 Byte AS: advertised and received
AddPath:
IPv4 Unicast: RX advertised IPv4 Unicast and received
Route refresh: advertised and received(old & new)
Address Family IPv4 Unicast: advertised and received
Hostname Capability: advertised (name: spine01,domain name: n/a) received (name: leaf01,domain name: n/a)
Graceful Restart Capability: advertised and received
Remote Restart timer is 120 seconds
Address families by peer:
none
Graceful restart information:
End-of-RIB send: IPv4 Unicast
End-of-RIB received: IPv4 Unicast
Message statistics:
Inq depth is 0
Outq depth is 0
Sent Rcvd
Opens: 2 2
Notifications: 0 2
Updates: 424 369
Keepalives: 633 633
Route Refresh: 0 0
Capability: 0 0
Total: 1059 1006
Minimum time between advertisement runs is 0 seconds
For address family: IPv4 Unicast
Update group 1, subgroup 1
Packet Queue length 0
Community attribute sent to this neighbor(all)
3 accepted prefixes
Connections established 2; dropped 1
Last reset 00:02:37, Notification received (Cease/Other Configuration Change)
Local host: fe80::7c41:fff:fe93:b711, Local port: 45586
Foreign host: fe80::2294:15ff:fe02:7bbf, Foreign port: 179
Nexthop: 10.10.10.101
Nexthop global: fe80::7c41:fff:fe93:b711
Nexthop local: fe80::7c41:fff:fe93:b711
BGP connection: shared network
BGP Connect Retry Timer in Seconds: 10
Peer Authentication Enabled
Read thread: on Write thread: on FD used: 27
Cumulus Linux does not enforce the MD5 password configured against a BGP listen-range peer group (used to accept and create dynamic BGP neighbors) and accepts connections from peers that do not specify a password.
Remove Private BGP ASNs
If you use private ASNs in the data center, routes advertised to neighbors contain your private ASNs. The examples below show how to remove the private ASNs from routes and how to replace the private ASNs with your public ASN.
The following example command removes private ASNs from routes advertised to the neighbor on swp51 (an unnumbered interface):
With the above configuration, the vtysh show ip bgp vrf RED summary command and the net show bgp vrf RED summary command output shows the local ASN as 65532.
cumulus@border01:mgmt:~$ sudo vtysh
...
border01# show ip bgp vrf RED summary
ipv4 unicast summary
BGP router identifier 10.10.10.63, local AS number 65532 vrf-id 35
BGP table version 1
RIB entries 1, using 192 bytes of memory
Peers 1, using 21 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
fw1(swp3) 4 65199 2015 2015 0 0 0 01:40:36 1 1
Total number of neighbors 1
...
The vtysh show ip bgp summary command and the net show bgp summary command displays the global table, where the local ASN 65132 peers with spine01.
cumulus@border01:mgmt:~$ sudo vtysh
...
leaf01# show ip bgp summary
ipv4 unicast summary
BGP router identifier 10.10.10.63, local AS number 65132 vrf-id 0
BGP table version 3
RIB entries 5, using 960 bytes of memory
Peers 1, using 43 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
spine01(swp51) 4 65199 2223 2223 0 0 0 01:50:18 1 3
Total number of neighbors 1
...
BGP allowas-in
To prevent loops, the switch automatically discards BGP network prefixes if it sees its own ASN in the AS path. However, you can configure Cumulus Linux to receive and process routes even if it detects its own ASN in the AS path (allowas-in).
To enable allowas-in:
cumulus@switch:~$ nv set vrf default router bgp neighbor swp51 address-family ipv4-unicast aspath allow-my-asn enable on
cumulus@switch:~$ nv config apply
You can configure BGP to use a specific IP address when exchanging BGP updates with a neighbor. For example, in a numbered BGP configuration, you can set the source IP address to be the loopback address of the switch.
BGP supports equal-cost multipathing (ECMP). If a BGP node hears a certain prefix from multiple peers, it has the information necessary to program the routing table and forward traffic for that prefix through all these peers. BGP typically chooses one best path for each prefix and installs that route in the forwarding table.
Cumulus Linux enables the BGP multipath option by default and sets the maximum number of paths to 64 so that the switch can install multiple equal-cost BGP paths to the forwarding table and load balance traffic across multiple links. You can change the number of paths allowed, according to your needs.
The example commands change the maximum number of paths to 120. You can set a value between 1 and 256. 1 disables the BGP multipath option.
When you enable BGP multipath, Cumulus Linux load balances BGP routes from the same AS. If the routes go across several different AS neighbors, even if the AS path length is the same, they are not load balanced. To load balance between multiple paths received from different AS neighbors:.
cumulus@switch:~$ nv set vrf default router bgp path-selection multipath aspath-ignore on
cumulus@switch:~$ nv config apply
When you disable the bestpath as-path multipath-relax option, EVPN type-5 routes do not use the updated configuration. Type-5 routes continue to use all available ECMP paths in the underlay fabric, regardless of ASN.
Advertise IPv4 Prefixes with IPv6 Next Hops
RFC 5549 defines how BGP advertises IPv4 prefixes with IPv6 next hops. The RFC does not make a distinction between whether the IPv6 peering and next hop values must be global unicast addresses (GUA) or link-local addresses. Cumulus Linux supports advertising IPv4 prefixes with IPv6 global unicast and link-local next hop addresses, with either unnumbered or numbered BGP.
When BGP peering uses IPv6 global addresses, and BGP advertises and installs IPv4 prefixes, Cumulus Linux uses IPv6 route advertisements to derive the MAC address of the peer so that FRR can create an IPv4 route with a link-local IPv4 next hop address (defined by RFC 3927). FRR configures these route advertisement settings automatically upon receiving an update from a BGP peer that uses IPv6 global addresses with an IPv4 prefix and an IPv6 next hop, and after it negotiates the enhanced-next hop capability.
To enable advertisement of IPv4 prefixes with IPv6 next hops over global IPv6 peerings, add the extended-nexthop capability to the global IPv6 neighbor statements on each end of the BGP sessions.
cumulus@switch:~$ nv set vrf default router bgp neighbor 2001:db8:0002::0a00:0002 capabilities extended-nexthop on
cumulus@switch:~$ nv config apply
Ensure that you have activated the IPv6 peers under the IPv4 unicast address family; otherwise, all peers activate in the IPv4 unicast address family by default. If you configure no bgp default ipv4-unicast, you need to activate the IPv6 neighbor under the IPv4 unicast address family as shown below:
cumulus@switch:~$ nv set vrf default router bgp neighbor 2001:db8:0002::0a00:0002 capabilities extended-nexthop on
cumulus@switch:~$ nv set vrf default router bgp neighbor 2001:db8:0002::0a00:0002 address-family ipv4-unicast enable on
cumulus@switch:~$ nv config apply
To protect against an internal network connectivity disruption caused by BGP, you can control the number of route announcements (prefixes) you want to receive from a BGP neighbor.
The following example commands set the maximum number of prefixes allowed from the BGP neighbor on swp51 to 3000:
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 address-family ipv4-unicast prefix-limits inbound maximum 3000
cumulus@leaf01:~$ nv config apply
The summary-only option ensures that BGP suppresses longer-prefixes inside the aggregate address before sending updates:
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast aggregate-route 10.1.0.0/16 summary-only on
cumulus@leaf01:~$ nv config apply
Suppress Route Advertisement
You can configure BGP to wait for a response from the RIB indicating that the routes installed in the RIB are also installed in the ASIC before sending updates to peers.
cumulus@leaf01:~$ nv set router bgp wait-for-install on
cumulus@leaf01:~$ nv config apply
ISSU suppresses route advertisement automatically when upgrading or troubleshooting an active switch so that there is minimal disruption to the network.
BGP add-path
Cumulus Linux supports both BGP add-path RX and BGP add-path TX.
BGP add-path RX
BGP add-path RX enables BGP to receive multiple paths for the same prefix. A path identifier ensures that additional paths do not override previously advertised paths. Cumulus Linux enables BGP add-path RX by default; you do not need to perform additional configuration.
To view the existing capabilities, run the vtysh show ip bgp neighbors command. You can see the existing capabilities in the subsection Add Path, below Neighbor capabilities.
The following example output shows that BGP can send and receive additional BGP paths, and that the BGP neighbor on swp51 supports both.
cumulus@leaf01:~$ sudo vtysh
...
leaf01# show ip bgp neighbors
BGP neighbor on swp51: fe80::7c41:fff:fe93:b711, remote AS 65199, local AS 65101, external link
Hostname: spine01
BGP version 4, remote router ID 10.10.10.101, local router ID 10.10.10.1
BGP state = Established, up for 1d12h39m
Last read 00:00:03, Last write 00:00:01
Hold time is 9, keepalive interval is 3 seconds
Neighbor capabilities:
4 Byte AS: advertised and received
AddPath:
IPv4 Unicast: RX advertised IPv4 Unicast and received
Extended nexthop: advertised and received
Address families by peer:
IPv4 Unicast
Route refresh: advertised and received(old & new)
Address Family IPv4 Unicast: advertised and received
Hostname Capability: advertised (name: leaf01,domain name: n/a) received (name: spine01,domain name: n/a)
Graceful Restart Capability: advertised and received
...
To view the current additional paths, run the vtysh show ip bgp <prefix> command or the net show bgp <prefix> command. The example output shows that the TX node adds an additional path for receiving. Each path has a unique AddPath ID.
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show ip bgp 10.10.10.9
BGP routing table entry for 10.10.10.9/32
Paths: (2 available, best #1, table Default-IP-Routing-Table)
Advertised to non peer-group peers:
spine01(swp51) spine02(swp52)
65020 65012
fe80::4638:39ff:fe00:5c from spine01(swp51) (10.10.10.12)
(fe80::4638:39ff:fe00:5c) (used)
Origin incomplete, localpref 100, valid, external, multipath, bestpath-from-AS 65020, best (Older Path)
AddPath ID: RX 0, TX 6
Last update: Wed Nov 16 22:47:00 2016
65020 65012
fe80::4638:39ff:fe00:2b from spine02(swp52) (10.10.10.12)
(fe80::4638:39ff:fe00:2b) (used)
Origin incomplete, localpref 100, valid, external, multipath
AddPath ID: RX 0, TX 3
Last update: Fri Oct 2 03:56:33 2020
BGP add-path TX
BGP add-path TX enables BGP to advertise more than just the best path for a prefix. Cumulus Linux includes two options:
addpath-tx-all-paths advertises all known paths to a neighbor.
addpath-tx-bestpath-per-AS advertises only the best path learned from each AS to a neighbor.
The following example commands configure leaf01 to advertise the best path learned from each AS to the BGP neighbor on swp50:
The following example configuration shows how BGP add-path TX advertises the best path learned from each AS.
In this configuration:
Every leaf and every spine has a different ASN
eBGP is configured between:
leaf01 and spine01, spine02
leaf03 and spine01, spine02
leaf01 and leaf02 (leaf02 only has a single peer, which is leaf01)
leaf01 is configured to advertise the best path learned from each AS to BGP neighbor leaf02
leaf03 generates a loopback IP address (10.10.10.3/32) into BGP with a network statement
When you run the show ip bgp 10.10.10.3/32 command or the net show bgp 10.10.10.3/32 command on leaf02, the command output shows the leaf03 loopback IP address and two BGP paths, both from leaf01:
cumulus@leaf02:mgmt:~$ sudo vtysh
...
leaf02# show ip bgp 10.10.10.3/32
BGP routing table entry for 10.10.10.3/32
Paths: (2 available, best #2, table default)
Advertised to non peer-group peers:
leaf01(swp50)
65101 65199 65103
fe80::4638:39ff:fe00:13 from leaf01(swp50) (10.10.10.1)
(fe80::4638:39ff:fe00:13) (used)
Origin IGP, valid, external
AddPath ID: RX 4, TX-All 0 TX-Best-Per-AS 0
Last update: Thu Oct 15 18:31:46 2020
65101 65198 65103
fe80::4638:39ff:fe00:13 from leaf01(swp50) (10.10.10.1)
(fe80::4638:39ff:fe00:13) (used)
Origin IGP, valid, external, bestpath-from-AS 65101, best (Nothing left to compare)
AddPath ID: RX 3, TX-All 0 TX-Best-Per-AS 0
Last update: Thu Oct 15 18:31:46 2020
Conditional Advertisement
Routes are typically propagated even if a different path exists. The BGP conditional advertisement feature lets you advertise certain routes only if other routes either do or do not exist.
This feature is typically used in multihomed networks where BGP advertises some prefixes to one of the providers only if information from the other provider is not present. For example, a multihomed router can use conditional advertisement to choose which upstream provider learns about the routes it provides so that it can influence which provider handles traffic destined for the downstream router. This is useful for cost of service, latency, or other policy requirements that are not natively accounted for in BGP.
Conditional advertisement uses the non-exist-map or the exist-map and the advertise-map keywords to track routes by route prefix.
You configure the BGP neighbors to use the route maps.
Use caution when configuring conditional advertisement on a large number of BGP neighbors. Cumulus Linux scans the entire RIB table every 60 seconds by default; depending on the number of routes in the RIB, this can result in longer processing times. NVIDIA does not recommend that you configure conditional advertisement on more than 50 neighbors.
The following example commands configure the switch to send a 10.0.0.0/8 summary route only if the 10.0.0.0/24 route exists in the routing table. The commands perform the following configuration:
Enable the conditional advertisement option.
Create a prefix list called EXIST with the route 10.0.0.0/24.
Create a route map called EXISTMAP that uses the prefix list EXIST. You must provide the route map match type (ipv4 or ipv6).
Create a prefix list called ADVERTISE with the route to advertise (10.0.0.0/8).
Create a route map called ADVERTISEMAP that uses the prefix list ADVERTISE. You must provide the route map match type (ipv4 or ipv6).
Configure BGP neighbor swp51 to use the route maps.
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 address-family ipv4-unicast conditional-advertise enable on
cumulus@leaf01:~$ nv set router policy prefix-list EXIST rule 10 match 10.0.0.0/24
cumulus@leaf01:~$ nv set router policy prefix-list EXIST rule 10 action permit
cumulus@leaf01:~$ nv set router policy route-map EXISTMAP rule 10 match type ipv4
cumulus@leaf01:~$ nv set router policy route-map EXISTMAP rule 10 action permit
cumulus@leaf01:~$ nv set router policy route-map EXISTMAP rule 10 match ip-prefix-list EXIST
cumulus@leaf01:~$ nv set router policy prefix-list ADVERTISE rule 10 action permit
cumulus@leaf01:~$ nv set router policy prefix-list ADVERTISE rule 10 match 10.0.0.0/8
cumulus@leaf01:~$ nv set router policy route-map ADVERTISEMAP rule 10 match type ipv4
cumulus@leaf01:~$ nv set router policy route-map ADVERTISEMAP rule 10 action permit
cumulus@leaf01:~$ nv set router policy route-map ADVERTISEMAP rule 10 match ip-prefix-list ADVERTISE
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 address-family ipv4-unicast conditional-advertise advertise-map ADVERTISEMAP
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 address-family ipv4-unicast conditional-advertise exist-map EXIST
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh
...
leaf01# configure terminal
leaf01(config)# ip prefix-list EXIST seq 10 permit 10.0.0.0/24
leaf01(config)# route-map EXISTMAP permit 10
leaf01(config-route-map)# match ip address prefix-list EXIST
leaf01(config-route-map)# exit
leaf01(config)# ip prefix-list ADVERTISE seq 10 permit 10.0.0.0/8
leaf01(config)# route-map ADVERTISEMAP permit 10
leaf01(config-route-map)# match ip address prefix-list ADVERTISE
leaf01(config-route-map)# exit
leaf01(config)# router bgp
leaf01(config-router)# neighbor swp51 advertise-map ADVERTISEMAP exist-map EXISTMAP
leaf01(config-router)# end
leaf01# write memory
leaf01# exit
The commands save the configuration in the /etc/frr/frr.conf file. For example:
cumulus@leaf01:~$ sudo cat /etc/frr/frr.conf
...
neighbor swp51 activate
neighbor swp51 advertise-map ADVERTISEMAP exist-map EXIST
...
ip prefix-list ADVERTISE seq 10 permit 10.0.0.0/8
ip prefix-list EXIST seq 10 permit 10.0.0.0/24
route-map ADVERTISEMAP permit 10
match ip address prefix-list ADVERTISE
route-map EXISTMAP permit 10
match ip address prefix-list EXIST
Cumulus Linux scans the entire RIB table every 60 seconds. You can set the conditional advertisement timer to increase or decrease how often you want Cumulus Linux to scan the RIB table. You can set a value between 5 and 240 seconds.
A lower value (such as 5) increases the amount of processing needed. Use caution when configuring conditional advertisement on a large number of BGP neighbors.
By default, next hop tracking does not resolve next hops through the default route. If you want BGP to peer across the default route, run the vtysh ip nht resolve-via-default command.
The following example command configures BGP to peer across the default route from the default VRF.
The following example command configures BGP to peer across the default route from VRF BLUE:
cumulus@leaf01:~$ sudo vtysh
leaf01# configure terminal
leaf01(config)# vrf BLUE
leaf01(config-vrf)# ip nht resolve-via-default
leaf01(config-vrf)# end
leaf01# write memory
leaf01# exit
cumulus@leaf01:~$
BGP Timers
BGP includes several timers that you can configure.
Keepalive Interval and Hold Time
By default, BGP exchanges periodic keepalive messages to measure and ensure that a peer is still alive and functioning. If BGP does not receive a keepalive or update message from the peer within the hold time, it declares the peer down and withdraws all routes received by this peer from the local BGP table. By default, the keepalive interval is 3 seconds and the hold time is 9 seconds. To decrease CPU load when there are a lot of neighbors, you can increase the values of these timers or disable the exchange of keepalives. When manually configuring new values, the keepalive interval can be less than or equal to one third of the hold time, but cannot be less than 1 second. Setting the keepalive and hold time values to 0 disables the exchange of keepalives.
The following example commands set the keepalive interval to 10 seconds and the hold time to 30 seconds.
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 timers keepalive 10
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 timers hold 30
cumulus@leaf01:~$ nv config apply
To set the timers back to the default values, run the nv unset vrf <vrf> router bgp neighbor <interface> timers keepalive and the nv unset vrf <vrf> router bgp neighbor <interface> timers hold commands.
By default, the BGP process attempts to connect to a peer after a failure (or on startup) every 10 seconds. You can change this value to suit your needs.
The following example commands set the reconnect value to 30 seconds:
After making a new best path decision for a prefix, BGP can insert a delay before advertising the new results to a peer. This delay rate limits the amount of changes advertised to downstream peers and lowers processing requirements by slowing down convergence. By default, this interval is 0 seconds for both eBGP and iBGP sessions, which allows for fast convergence. For more information about the advertisement interval, see this IETF draft.
The following example commands set the advertisement interval to 5 seconds:
iBGP rules state that BGP cannot send a route learned from an iBGP peer to another iBGP peer. In a data center spine and leaf network using iBGP, this prevents a spine from sending a route learned from a leaf to any other leaf. As a workaround, you can use a route reflector. When an iBGP speaker is a route reflector, it can send iBGP learned routes to other iBGP peers.
In the following example, spine01 is acting as a route reflector. The leaf switches, leaf01, leaf02 and leaf03 are route reflector clients. BGP sends any route that spine01 learns from a route reflector client to other route reflector clients.
To configure the BGP node as a route reflector for a BGP peer, set the neighbor route-reflector-client option. The following example sets spine01 shown in the illustration above to be a route reflector for leaf01 (on swp1), which is a route reflector client. You do not have to configure the client.
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp1 address-family ipv4-unicast route-reflector-client on
cumulus@spine01:~$ nv config apply
When you configure BGP for IPv6, you must run the route-reflector-client command after the activate command.
You can only configure a BGP node as a route reflector for an iBGP peer.
Administrative Distance
Cumulus Linux uses the administrative distance to choose which routing protocol to use when two different protocols provide route information for the same destination. The smaller the distance, the more reliable the protocol. For example, if the switch receives a route from OSPF with an administrative distance of 110 and the same route from BGP with an administrative distance of 100, the switch chooses BGP.
The following example commands set the administrative distance for external routes to 150 and internal routes to 110:
You can shut down all active BGP sessions with a neighbor and remove all associated routing information without removing its associated configuration. When shut down, the neighbor goes into an administratively idle state.
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 shutdown on
cumulus@leaf01:~$ nv config apply
To bring BGP sessions with the neighbor back up, run the nv set vrf default router bgp neighbor swp51 shutdown off command.
To bring BGP sessions with the neighbor back up, run the no neighbor swp51 shutdown command.
Graceful BGP Shutdown
To reduce packet loss during planned maintenance of a router or link, you can configure graceful BGP shutdown, which forces traffic to route around the BGP node:
cumulus@leaf01:~$ nv set router bgp graceful-shutdown on
cumulus@leaf01:~$ nv config apply
To disable graceful shutdown:
cumulus@leaf01:~$ nv set router bgp graceful-shutdown off
cumulus@leaf01:~$ nv config apply
When you enable graceful BGP shutdown, Cumulus Linux adds the graceful-shutdown community to all inbound and outbound routes from eBGP peers and sets the local-pref for that route to 0 (refer to RFC8326). To see the configuration, run the vtysh show ip bgp <route> command or the net show bgp <route> command. For example:
cumulus@leaf01:~$ sudo vtysh
leaf01# show ip bgp 10.10.10.0/24
BGP routing table entry for 10.10.10.0/24
Paths: (2 available, best #1, table Default-IP-Routing-Table)
Advertised to non peer-group peers:
bottom0(10.10.10.2)
30 20
10.10.10.2 (metric 10) from top1(10.10.10.2) (10.10.10.2)
Origin IGP, localpref 100, valid, internal, bestpath-from-AS 30, best
Community: 99:1
AddPath ID: RX 0, TX 52
Last update: Mon Sep 18 17:01:18 2017
20
10.10.10.3 from bottom0(10.10.10.32) (10.10.10.10)
Origin IGP, metric 0, localpref 0, valid, external, bestpath-from-AS 20
Community: 99:1 graceful-shutdown
AddPath ID: RX 0, TX 2
Last update: Mon Sep 18 17:01:18 2017
As optional configuration, you can create a route map to prepend the AS so that reduced preference using a longer AS path propagates to other parts of network.
cumulus@spine01:~$ sudo vtysh
...
spine01# show ip bgp 10.10.10.1/32
BGP routing table entry for 10.10.10.1/32
Paths: (1 available, best #1, table default)
Advertised to non peer-group peers:
65101 65101
10.10.10.1 from leaf01(10.10.10.1) (10.10.10.1)
Origin incomplete, metric 0, localpref 0, valid, external, bestpath-from-AS 65101, best (First path received)
Community: graceful-shutdown
Last update: Sun Dec 20 03:04:53 2020
Graceful BGP Restart
When BGP restarts on a switch, all BGP peers detect that the session goes down and comes back up. This session transition results in a routing flap on BGP peers that causes BGP to recompute routes, generate route updates, and add unnecessary churn to the forwarding tables. The routing flaps can create transient forwarding blackholes and loops, and also consume resources on the switches affected by the flap, which can affect overall network performance.
To minimize the negative effects that occur when BGP restarts, Cumulus Linux enables graceful BGP restart by default, which lets a BGP speaker signal to its peers that it can preserve its forwarding state and continue data forwarding during a restart. BGP graceful restart also enables a BGP speaker to continue to use routes announced by a peer even after the peer has gone down.
When BGP establishes a session, BGP peers use the BGP OPEN message to negotiate a graceful restart. If the BGP peer also supports graceful restart, it activates for that neighbor session. If the BGP session stops, the BGP peer (the restart helper) flags all routes associated with the device as stale but continues to forward packets to these routes for a certain period of time. The restarting device also continues to forward packets during the graceful restart. After the device comes back up and establishes BGP sessions again with its peers (restart helpers), it waits to learn all routes that these peers announce before selecting a cumulative path; after which, it updates its forwarding tables and re-announces the appropriate routes to its peers. These procedures ensure that if there are any routing changes while the BGP speaker is restarting, the network converges.
For warm boot to restart the switch with no interruption to traffic for existing route entries, you must enable BGP graceful restart in all BGP VRFs.
Restart Modes
Cumulus Linux supports graceful BGP restart full mode and helper-only mode for IPv4 and IPv6. The default setting is helper-only mode.
In full mode, the switch is in both a helper and restarter role.
In helper-only mode, the switch is in a helper role only, where routes originated and advertised from a BGP peer are not deleted.
You can configure graceful BGP restart globally, where all BGP peers inherit the graceful restart capability, or for a BGP peer or peer group (useful for misbehaving peers or when working with third party devices).
BGP goes through a graceful restart (as a restarting router) with a planned switch restart event that ISSU initiates. Any other time BGP restarts, such as when the BGP daemon restarts due to a software exception, or you restart the FRR service, BGP goes through a regular restart where the BGP session with peers terminates and Cumulus Linux removes the learned routes from the forwarding plane.
Changing graceful restart mode results in BGP session flaps.
The switch has graceful restart enabled in helper-only mode by default. To set graceful BGP restart to full mode globally on the switch:
cumulus@leaf01:~$ nv set router bgp graceful-restart mode full
cumulus@leaf01:~$ nv config apply
To set graceful BGP restart to full mode on the BGP peer connected on swp51:
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 graceful-restart mode full
cumulus@leaf01:~$ nv config apply
To set graceful BGP restart back to the default setting (helper-only mode), run the nv unset router bgp graceful-restart command or the nv set router bgp graceful-restart mode helper-only command.
The switch has graceful restart enabled in helper-only mode by default. To set graceful BGP restart to full mode globally on the switch:
To set graceful BGP restart back to the default setting (helper-only mode), run the no bgp graceful-restart command or the no neighbor <interface> graceful-restart command
Disable Graceful Restart
If you disable graceful BGP restart, you cannot achieve a switch restart or switch software upgrade with minimal traffic loss in a BGP configuration. Refer to ISSU for more information.
To disable graceful BGP restart globally on the switch:
cumulus@leaf01:~$ nv set router bgp graceful-restart mode off
cumulus@leaf01:~$ nv config apply
You can configure the following graceful BGP restart timers.
Timer
Description
restart-time
The number of seconds to wait for a graceful restart capable peer to re-establish BGP peering. You can set a value between 1 and 4095. The default is 120 seconds.
pathselect-defer-time
The number of seconds a restarting peer defers path-selection when waiting for the EOR marker from peers. You can set a value between 0 and 3600. The default is 360 seconds.
stalepath-time
The number of seconds to hold stale routes for a restarting peer. You can set a value between 1 and 4095. The default is 360 seconds.
The following example commands set the restart-time to 400 seconds, pathselect-defer-time to 300 seconds, and stalepath-time to 400 seconds:
cumulus@leaf01:~$ nv set router bgp graceful-restart restart-time 400
cumulus@leaf01:~$ nv set router bgp graceful-restart path-selection-deferral-time 300
cumulus@leaf01:~$ nv set router bgp graceful-restart stale-routes-time 400
cumulus@leaf01:~$ nv config apply
Timer
Description
notification
Enables graceful BGP restart support for BGP NOTIFICATION messages.
preserve-fw-state
Sets the F-bit indication that the FIB is preserved when doing a graceful BPG restart.
restart-time
The number of seconds to wait for a graceful restart capable peer to re-establish BGP peering. You can set a value between 1 and 4095. The default is 120 seconds.
rib-stale-time
The stale route removal time in the RIB (in seconds). You can set a value between 1 and 3600.
select-defer-time
The number of seconds a restarting peer defers path-selection when waiting for the EOR marker from peers. You can set a value between 0 and 3600. The default is 360 seconds.
stalepath-time
The number of seconds to hold stale routes for a restarting peer. You can set a value between 1 and 4095. The default is 360 seconds.
The following example commands set the restart-time to 400 seconds, pathselect-defer-time to 300 seconds, and stalepath-time to 400 seconds:
To show graceful BGP restart information on a specific BGP peer, run the vtysh show ip bgp neighbor <neighbor> graceful-restart command or the net show bgp neighbor <neighbour> graceful-restart command.
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show ip bgp neighbor swp51 graceful-restart
Codes: GR - Graceful Restart, * - Inheriting Global GR Config,
Restart - GR Mode-Restarting, Helper - GR Mode-Helper,
Disable - GR Mode-Disable.
BGP neighbor on swp51: fe80::4638:39ff:fe00:2, remote AS 65199, local AS 65101, external link
BGP state = Established, up for 00:15:54
Neighbor GR capabilities:
Graceful Restart Capability: advertised and received
Remote Restart timer is 120 seconds
Address families by peer:
none
Graceful restart information:
End-of-RIB send: IPv4 Unicast
End-of-RIB received: IPv4 Unicast
Local GR Mode: Helper*
Remote GR Mode: Helper
R bit: False
Timers:
Configured Restart Time(sec): 120
Received Restart Time(sec): 120
IPv4 Unicast:
F bit: False
End-of-RIB sent: Yes
End-of-RIB sent after update: Yes
End-of-RIB received: Yes
Timers:
Configured Stale Path Time(sec): 360
Enable Read-only Mode
Sometimes, as Cumulus Linux establishes BGP peers and receives updates, it installs prefixes in the RIB and advertises them to BGP peers before receiving and processing information from all the peers. Also, depending on the timing of the updates, Cumulus Linux sometimes installs prefixes, then withdraws and replaces them with new routing information. Read-only mode minimizes this BGP route churn in both the local RIB and with BGP peers.
Enable read-only mode to reduce CPU and network usage when restarting the BGP process. Because intermediate best paths are possible for the same prefix as peers establish and start receiving updates at different times, read-only mode is useful in topologies where BGP learns a prefix from a large number of peers and the network has a high number of prefixes.
While in read-only mode, BGP does not run best-path or generate any updates to its peers.
The following example commands enable read-only mode:
cumulus@leaf01:~$ nv set router bgp convergence-wait time 300
cumulus@leaf01:~$ nv set router bgp convergence-wait establish-wait-time 200
cumulus@leaf01:~$ nv config apply
To show the configured timers and information about the transitions when a convergence event occurs, run the vtysh show ip bgp summary command or the net show bgp summary command.
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show ip bgp summary
ipv4 Unicast Summary
BGP router identifier 10.10.10.1, local AS number 65101 vrf-id 0
Read-only mode update-delay limit: 300 seconds
Establish wait: 200 seconds
BGP table version 0
RIB entries 3, using 576 bytes of memory
Peers 1, using 21 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt
spine01(swp51) 4 65199 30798 30802 0 0 0 1d01h09m 0 0
Total number of neighbors 1
...
The vtysh show ip bgp summary json command and the net show bgp summary json command shows the last convergence event.
BGP Community Lists
You can use community lists to define a BGP community to tag one or more routes. You can then use the communities to apply a route policy on either egress or ingress.
The BGP community list can be either standard or expanded. The standard BGP community list is a pair of values (such as 100:100) that you can tag on a specific prefix and advertise to other neighbors or you can apply them on route ingress. Or, the standard BGP community list can be one of four BGP default communities:
internet: a BGP community that matches all routes
local-AS: a BGP community that restricts routes to your confederation’s sub-AS
no-advertise: a BGP community that is not advertised to anyone
no-export: a BGP community that is not advertised to the eBGP peer
An expanded BGP community list takes a regular expression of communities and matches the listed communities.
When the neighbor receives the prefix, it examines the community value and takes action accordingly, such as permitting or denying the community member in the routing policy.
Community list names must start with a letter and can contain letters, digits, underscores and dashes. For example, you can name a community list COMMUNITY1 or EXTENDED-COMMUNITY_10 but you cannot name a community list 10 or 10_COMMUNITY.
Here is an example of a standard community list filter:
cumulus@leaf01:~$ nv set router policy community-list COMMUNITY1 rule 10 action permit
cumulus@leaf01:~$ nv set router policy community-list COMMUNITY1 rule 10 community 100:100
cumulus@leaf01:~$ nv config apply
To use a special character, such as a period (.) in the regular expression for an expanded BGP community list, you must escape the character with a backslash (\). For example, nv set router policy community-list COMMUNITY1 rule 10 community "\.*_65000:2002_.*".
You can apply the community list to a route map to define the routing policy:
cumulus@leaf01:~$ nv set router policy route-map ROUTEMAP1 rule 10 match community-list COMMUNITY1
cumulus@leaf01:~$ nv set router policy route-map ROUTEMAP1 rule 10 action permit
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh
...
leaf01# configure terminal
leaf01(config)# route-map ROUTEMAP1
leaf01(config-route-map)# match community COMMUNITY1
leaf01(config-route-map)# end
leaf01# write memory
leaf01# exit
Cumulus Linux considers the full list of communities on a BGP route as a single string to evaluate. If you try to match $ (ends with), Cumulus Linux matches the last community value in the list of communities, not the individual community values within the list.
For example, if you use the regular expression ".*:(20)$", Cumulus Linux matches all the BGP routes with a list of communities ending in 20.
Routes with communities 45000:10 55000:40 65000:15000 123:20 match.
Routes with communities 45000:10 55000:20 65000:15000 do not match.
Run the following commands to help you troubleshoot BGP.
Show BGP configuration Summary
To show a summary of the BGP configuration on the switch, run the vtysh show ip bgp summary command. For example:
cumulus@switch:~$ sudo vtysh
...
switch# show ip bgp summary
ipv4 Unicast Summary
BGP router identifier 10.10.10.1, local AS number 65101 vrf-id 0
BGP table version 88
RIB entries 25, using 4800 bytes of memory
Peers 5, using 106 KiB of memory
Peer groups 1, using 64 bytes of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
spine01(swp51) 4 65199 31122 31194 0 0 0 1d01h44m 7
spine02(swp52) 4 65199 31060 31151 0 0 0 01:47:13 7
spine03(swp53) 4 65199 31150 31207 0 0 0 01:48:31 7
spine04(swp54) 4 65199 31042 31098 0 0 0 01:46:57 7
leaf02(peerlink.4094) 4 65101 30919 30913 0 0 0 01:47:43 12
Total number of neighbors 5
To view the routing table as defined by BGP, run the vtysh show ip bgp ipv4 unicast command or the net show bgp ipv4 unicast command. For example:
cumulus@leaf01:~$ sudo vtysh
...
leaf01# show ip bgp ipv4 unicast
GP table version is 88, local router ID is 10.10.10.1, vrf id 0
Default local pref 100, local AS 65101
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
* i10.0.1.1/32 peerlink.4094 0 100 0 ?
*> 0.0.0.0 0 32768 ?
*= 10.0.1.2/32 swp54 0 65199 65102 ?
*= swp52 0 65199 65102 ?
* i peerlink.4094 100 0 65199 65102 ?
*= swp53 0 65199 65102 ?
*> swp51 0 65199 65102 ?
*= 10.0.1.254/32 swp54 0 65199 65132 ?
*= swp52 0 65199 65132 ?
* i peerlink.4094 100 0 65199 65132 ?
*= swp53 0 65199 65132 ?
*> swp51 0 65199 65132 ?
*> 10.10.10.1/32 0.0.0.0 0 32768 ?
*>i10.10.10.2/32 peerlink.4094 0 100 0 ?
*= 10.10.10.3/32 swp54 0 65199 65102 ?
*= swp52 0 65199 65102 ?
* i peerlink.4094 100 0 65199 65102 ?
*= swp53 0 65199 65102 ?
*> swp51 0 65199 65102 ?
...
To show a more detailed breakdown of a specific neighbor, run the vtysh show ip bgp neighbor <neighbor> command or the NVUE nv show vrf <vrf> router bgp neighbor <neighbor> command:
cumulus@switch:~$ nv show vrf default router bgp neighbor swp51
operational applied
----------------------------- ------------------------- ----------
description none
enforce-first-as off
multihop-ttl auto
nexthop-connected-check on
passive-mode off
password none
address-family
ipv4-unicast
enable on
add-path-tx off
nexthop-setting auto
route-reflector-client off
route-server-client off
soft-reconfiguration off
aspath
private-as none
replace-peer-as off
allow-my-asn
enable off
attribute-mod
aspath off on
med off on
nexthop off on
...
To see details of a specific route, such as its source and destination, run the vtysh show ip bgp <route> command or the net show bgp <route> command.
cumulus@switch:~$ sudo vtysh
...
switch# show ip bgp 10.10.10.3/32
GP routing table entry for 10.10.10.3/32
Paths: (5 available, best #5, table default)
Advertised to non peer-group peers:
spine01(swp51) spine02(swp52) spine03(swp53) spine04(swp54) leaf02(peerlink.4094)
65199 65102
fe80::8e24:2bff:fe79:7d46 from spine04(swp54) (10.10.10.104)
(fe80::8e24:2bff:fe79:7d46) (used)
Origin incomplete, valid, external, multipath
Last update: Wed Oct 7 13:13:13 2020
65199 65102
fe80::841:43ff:fe27:caf from spine02(swp52) (10.10.10.102)
(fe80::841:43ff:fe27:caf) (used)
Origin incomplete, valid, external, multipath
Last update: Wed Oct 7 13:13:14 2020
65199 65102
fe80::90b1:7aff:fe00:3121 from leaf02(peerlink.4094) (10.10.10.2)
Origin incomplete, localpref 100, valid, internal
Last update: Wed Oct 7 13:13:08 2020
65199 65102
fe80::48e7:fbff:fee9:5bcf from spine03(swp53) (10.10.10.103)
(fe80::48e7:fbff:fee9:5bcf) (used)
Origin incomplete, valid, external, multipath
Last update: Wed Oct 7 13:13:13 2020
65199 65102
fe80::7c41:fff:fe93:b711 from spine01(swp51) (10.10.10.101)
(fe80::7c41:fff:fe93:b711) (used)
Origin incomplete, valid, external, multipath, bestpath-from-AS 65199, best (Older Path)
Last update: Wed Oct 7 13:13:13 2020
Check BGP Timer Settings
To check BGP timers, such as the BGP keepalive interval, hold time, and advertisement interval, run the NVUE nv show vrf default router bgp neighbor <neighbor> timers command or the vtysh show ip bgp neighbor <peer> command. For example:
cumulus@leaf01:~$ nv show vrf default router bgp neighbor swp51 timers
operational applied
------------------- ----------- -------
connection-retry 10 auto
hold 9000 auto
keepalive 3000 auto
route-advertisement auto
BGP Update Groups
You can show information about update group events or information about a specific IPv4 or IPv6 update group.
To show information about update group events, run the vtysh show bgp update-group command or run these NVUE commands:
nv show vrf <vrf-id> router bgp address-family ipv4-unicast update-group for IPv4
nv show vrf <vrf-id> router bgp address-family ipv6-unicast update-group for IPv6
cumulus@leaf01:~$ nv show vrf default router bgp address-family ipv4-unicast update-group
Time created LocalAs change Prepend Flag Replace AS flag Minimum advertisement interval Routemap Update group Summary
- ------------ -------------- ------------ --------------- ------------------------------ -------- ------------ ------------
5 1674253324 0 5 sub-group: 5
To show information about a specific update group, such as the number of peer refresh events, prune events, and packet queue length, run the vtysh show bgp update-group <group-id> command or run these NVUE commands:
nv show vrf <vrf-id> router bgp address-family ipv4-unicast update-group <group-id> -o json for IPv4
nv show vrf <vrf-id> router bgp address-family ipv6-unicast update-group <group-id> -o json for IPv6
You can run NVUE commands to show route statistics for a BGP neighbor, such as the number of routes, and information about advertised and received routes.
To show the route count, run the following NVUE commands:
nv show vrf <vrf-id> router bgp neighbor <neighbor-id> address-family ipv4-unicast route-counters for IPv4.
nv show vrf <vrf-id> router bgp neighbor <neighbor-id> address-family ipv6-unicast route-counters for IPv6.
To show a summary of all the BGP IPv4 or IPv6 next hops, run the nv show vrf <vrf> router bgp nexthop ipv4 or nv show vrf <vrf> router bgp nexthop ipv6 command. The output shows the IGP metric, the number of paths pointing to a next hop, and the address or interface used to reach a next hop.
cumulus@leaf01:mgmt:~$ nv show vrf default router bgp nexthop ipv4
Nexthops
===========
PathCnt - Number of paths pointing to this Nexthop, ResolvedVia - Resolved via
address or interface, Interface - Resolved via interface
Address IGPMetric Valid PathCnt ResolvedVia Interface
----------- --------- ----- ------- ------------------------- -------------
10.0.1.34 0 on 160 fe80::4ab0:2dff:fe60:910e swp54
fe80::4ab0:2dff:fea7:7852 swp53
fe80::4ab0:2dff:fec8:8fb9 swp52
fe80::4ab0:2dff:feff:e147 swp51
10.10.10.2 0 on 15 fe80::4ab0:2dff:fe2d:495c peerlink.4094
10.10.10.3 0 on 15 fe80::4ab0:2dff:fe60:910e swp54
fe80::4ab0:2dff:fea7:7852 swp53
fe80::4ab0:2dff:fec8:8fb9 swp52
fe80::4ab0:2dff:feff:e147 swp51
10.10.10.4 0 on 15 fe80::4ab0:2dff:fe60:910e swp54
fe80::4ab0:2dff:fea7:7852 swp53
fe80::4ab0:2dff:fec8:8fb9 swp52
fe80::4ab0:2dff:feff:e147 swp51
10.10.10.63 0 on 15 fe80::4ab0:2dff:fe60:910e swp54
fe80::4ab0:2dff:fea7:7852 swp53
fe80::4ab0:2dff:fec8:8fb9 swp52
fe80::4ab0:2dff:feff:e147 swp51
10.10.10.64 0 on 15 fe80::4ab0:2dff:fe60:910e swp54
fe80::4ab0:2dff:fea7:7852 swp53
fe80::4ab0:2dff:fec8:8fb9 swp52
fe80::4ab0:2dff:feff:e147 swp51
To show information about a specific next hop, run the vtysh show bgp vrf default nexthop <ip-address> command or run these NVUE commands:
nv show vrf <vrf-id> router bgp nexthop ipv4 ip-address <ip-address> -o json for IPv4
nv show vrf <vrf-id> router bgp nexthop ipv6 ip-address <ip-address> -o json for IPv6
To verify that FRR learns the neighboring link-local IPv6 address through the IPv6 neighbor discovery router advertisements on a given interface, run the vtysh show interface <interface> command or the net show interface <interface> command.
If you do not enable ipv6 nd suppress-ra on both ends of the interface, Neighbor address(s): shows the link-local address of the other end (the address that BGP uses when that interface uses BGP).
Cumulus Linux automatically enables IPv6 route advertisements (RAs) on an interface with IPv6 addresses. You do not need to run the no ipv6 nd suppress-ra command for BGP unnumbered.
cumulus@switch:~$ sudo vtysh
...
switch# show interface swp51
Interface swp51 is up, line protocol is up
Link ups: 0 last: (never)
Link downs: 0 last: (never)
PTM status: disabled
vrf: default
OS Description: leaf to spine
index 8 metric 0 mtu 9216 speed 1000
flags: <UP,BROADCAST,RUNNING,MULTICAST>
Type: Ethernet
HWaddr: 10:d8:68:d4:a6:81
inet6 fe80::12d8:68ff:fed4:a681/64
Interface Type Other
protodown: off
ND advertised reachable time is 0 milliseconds
ND advertised retransmit interval is 0 milliseconds
ND advertised hop-count limit is 64 hops
ND router advertisements sent: 217 rcvd: 216
ND router advertisements are sent every 10 seconds
ND router advertisements lifetime tracks ra-interval
ND router advertisement default router preference is medium
Hosts use stateless autoconfig for addresses.
Neighbor address(s):
inet6 fe80::f208:5fff:fe12:cc8c/128
Troubleshoot IPv4 Prefixes Learned with IPv6 Next Hops
To show IPv4 prefixes learned with IPv6 next hops, run the following commands.
The following examples show an IPv4 prefix learned from a BGP peer over an IPv6 session using IPv6 global addresses, but where the next hop installed by BGP is a link-local IPv6 address. This occurs when the session is directly between peers, and the BGP update for the prefix includes both link-local and global IPv6 addresses as next hops. If both global and link-local next hops exist, BGP prefers the link-local address for route installation.
cumulus@spine01:mgmt:~$ sudo vtysh
...
spine01# show ip bgp ipv4 unicast summary
BGP router identifier 10.10.10.101, local AS number 65199 vrf-id 0
BGP table version 3
RIB entries 3, using 576 bytes of memory
Peers 1, using 21 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
leaf01(2001:db8:2::a00:1) 4 65101 22 22 0 0 0 00:01:00 0
Total number of neighbors 1
cumulus@spine01:mgmt:~$ sudo vtysh
...
spine01# show ip bgp ipv4 unicast
BGP table version is 3, local router ID is 10.10.10.101, vrf id 0
Default local pref 100, local AS 65199
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
10.10.10.101/32 fe80::a00:27ff:fea6:b9fe 0 0 32768 i
Displayed 1 routes and 1 total paths
cumulus@spine01:~$ sudo vtysh
...
spine01# show ip bgp ipv4 unicast 10.10.10.101/32
BGP routing table entry for 10.10.10.101/32
Paths: (1 available, best #1, table default)
Advertised to non peer-group peers:
Leaf01(2001:db8:0002::0a00:1)
3
2001:db8:0002::0a00:1 from Leaf01(2001:db8:0002::0a00:1) (10.10.10.101)
(fe80::a00:27ff:fea6:b9fe) (used)
Origin IGP, metric 0, valid, external, bestpath-from-AS 3, best (First path received)
AddPath ID: RX 0, TX 3
Last update: Mon Oct 22 08:09:22 2018
The example output below shows the results of installing the route in the FRR RIB as well as the kernel FIB. The next hop installed in the FRR RIB is the link-local IPv6 address, which Cumulus Linux converts into an IPv4 link-local address, as required for installation into the kernel FIB.
cumulus@spine01:~$ sudo vtysh
...
spine01# show ip route 10.10.10.101/32
RIB entry for 10.10.10.101/32
===========================
Routing entry for 10.10.10.101/32
Known via "bgp", distance 20, metric 0, best
Last update 2d17h05m ago
* fe80::a00:27ff:fea6:b9fe, via swp1
FIB entry for 10.10.10.101/32
===========================
10.10.10.101/32 via 10.0.1.0 dev swp1 proto bgp metric 20 onlink
If BGP learns an IPv4 prefix with only an IPv6 global next hop address (when it learns the route through a route reflector), the command output shows the IPv6 global address as the next hop value. The command also shows that it learns recursively through the link-local address of the route reflector. When you use a global IPv6 address as a next hop for route installation in the FRR RIB, the switch still converts it into an IPv4 link-local address for installation into the kernel.
cumulus@leaf01:~$ sudo vtysh
...
leaf01# show ip bgp ipv4 unicast summary
BGP router identifier 10.10.10.1, local AS number 65101 vrf-id 0
BGP table version 1
RIB entries 1, using 152 bytes of memory
Peers 1, using 19 KiB of memory
Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd
Spine01(2001:db8:0002::0a00:2) 4 1 74 68 0 0 0 00:00:45 1
Total number of neighbors 1
cumulus@leaf01:~$ sudo vtysh
...
leaf01# show ip bgp ipv4 unicast summary
BGP table version is 1, local router ID is 10.10.10.1
Status codes: s suppressed, d damped, h history, * valid, > best, = multipath,
i internal, r RIB-failure, S Stale, R Removed
Origin codes: i - IGP, e - EGP, ? - incomplete
Network Next Hop Metric LocPrf Weight Path
*>i10.1.10.0/24 2001:2:2::4 0 100 0 i
Displayed 1 routes and 1 total paths
cumulus@leaf01:~$ sudo vtysh
...
leaf01# show ip bgp ipv4 unicast 10.10.10.101/32
BGP routing table entry for 10.10.10.101/32
Paths: (1 available, best #1, table default)
Not advertised to any peer
Local
2001:2:2::4 from Spine01(2001:1:1::1) (10.10.10.104)
Origin IGP, metric 0, localpref 100, valid, internal, bestpath-from-AS Local, best (First path received)
Originator: 10.0.0.14, Cluster list: 10.10.10.111
AddPath ID: RX 0, TX 5
Last update: Mon Oct 22 14:25:30 2018
cumulus@leaf01:~$ sudo vtysh
...
leaf01# show ip route 10.10.10.1/32
RIB entry for 10.10.10.1/32
===========================
Routing entry for 10.10.10.1/32
Known via "bgp", distance 200, metric 0, best
Last update 00:01:13 ago
2001:2:2::4 (recursive)
* fe80::a00:27ff:fe5a:84ae, via swp1
FIB entry for 10.10.10.1/32
===========================
10.10.10.1/32 via 10.0.1.1 dev swp1 proto bgp metric 20 onlink
To only use IPv6 global addresses for route installation into the FRR RIB, you must add an additional route map to the neighbor or peer group statement in the appropriate address family. When the route map command set ipv6 next-hop prefer-global applies to a neighbor, if both a link-local and global IPv6 address are in the BGP update for a prefix, BGP uses the IPv6 global address for route installation.
With this additional configuration, the output in the FRR RIB changes in the direct neighbor case as shown below:
router bgp 65101
bgp router-id 10.10.10.1
neighbor 2001:db8:2::a00:1 remote-as internal
neighbor 2001:db8:2::a00:1 capability extended-nexthop
!
address-family ipv4 unicast
neighbor 2001:db8:2::a00:1 route-map GLOBAL in
exit-address-family
!
route-map GLOBAL permit 20
set ipv6 next-hop prefer-global
!
The resulting FRR RIB output is as follows:
cumulus@leaf01:~$ sudo vtysh
...
leaf01# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR,
> - selected route, * - FIB route
B 0.0.0.0/0 [200/0] via 2001:2:2::4, swp2, 00:01:00
K 0.0.0.0/0 [0/0] via 10.0.2.2, eth0, 1d02h29m
C>* 10.0.0.9/32 is directly connected, lo, 5d18h32m
C>* 10.0.2.0/24 is directly connected, eth0, 03:51:31
B>* 172.16.4.0/24 [200/0] via 2001:2:2::4, swp2, 00:01:00ß
C>* 172.16.10.0/24 is directly connected, swp3, 5d18h32m
When the switch learns the route through a route reflector, it appears like this:
router bgp 65101
bgp router-id 10.10.10.1
neighbor 2001:db8:2::a00:2 remote-as internal
neighbor 2001:db8:2::a00:2 capability extended-nexthop
!
address-family ipv6 unicast
neighbor 2001:db8:2::a00:2 activate
neighbor 2001:db8:2::a00:2 route-map GLOBAL in
exit-address-family
!
route-map GLOBAL permit 10
set ipv6 next-hop prefer-global
cumulus@leaf01:~$ sudo vtysh
...
leaf01# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR,
> - selected route, * - FIB route
B 0.0.0.0/0 [200/0] via 2001:2:2::4, 00:00:01
K 0.0.0.0/0 [0/0] via 10.0.2.2, eth0, 3d00h26m
C>* 10.0.0.8/32 is directly connected, lo, 3d00h26m
C>* 10.0.2.0/24 is directly connected, eth0, 03:39:18
C>* 172.16.3.0/24 is directly connected, swp2, 3d00h26m
B> 172.16.4.0/24 [200/0] via 2001:2:2::4 (recursive), 00:00:01
* via 2001:1:1::1, swp1, 00:00:01
C>* 172.16.10.0/24 is directly connected, swp3, 3d00h26m
Neighbor State Change Log
Cumulus Linux records the changes that a neighbor goes through in syslog and in the /var/log/frr/frr.log file. For example:
020-10-05T15:51:32.621773-07:00 leaf01 bgpd[10104]: %NOTIFICATION: sent to neighbor peerlink.4094 6/7 (Cease/Connection collision resolution) 0 bytes
2020-10-05T15:51:32.623023-07:00 leaf01 bgpd[10104]: %ADJCHANGE: neighbor peerlink.4094(leaf02) in vrf default Up
2020-10-05T15:51:32.623156-07:00 leaf01 bgpd[10104]: %NOTIFICATION: sent to neighbor peerlink.4094 6/7 (Cease/Connection collision resolution) 0 bytes
2020-10-05T15:51:32.623496-07:00 leaf01 bgpd[10104]: %ADJCHANGE: neighbor peerlink.4094(leaf02) in vrf default Down No AFI/SAFI activated for peer
2020-10-05T15:51:33.040332-07:00 leaf01 bgpd[10104]: [EC 33554454] swp53 [Error] bgp_read_packet error: Connection reset by peer
2020-10-05T15:51:33.279468-07:00 leaf01 bgpd[10104]: [EC 33554454] swp52 [Error] bgp_read_packet error: Connection reset by peer
2020-10-05T15:51:33.339487-07:00 leaf01 bgpd[10104]: %ADJCHANGE: neighbor swp54(spine04) in vrf default Up
2020-10-05T15:51:33.340893-07:00 leaf01 bgpd[10104]: %ADJCHANGE: neighbor swp53(spine03) in vrf default Up
2020-10-05T15:51:33.341648-07:00 leaf01 bgpd[10104]: %ADJCHANGE: neighbor swp52(spine02) in vrf default Up
2020-10-05T15:51:33.342369-07:00 leaf01 bgpd[10104]: %ADJCHANGE: neighbor swp51(spine01) in vrf default Up
2020-10-05T15:51:33.627958-07:00 leaf01 bgpd[10104]: %ADJCHANGE: neighbor peerlink.4094(leaf02) in vrf default Up
Clear BGP Routes
NVUE provides commands to clear and refresh routes in the BGP table. You can clear routes for the IPv4, IPv6, and EVPN address families. The BGP clear commands do not clear counters in the kernel or hardware.
BGP clear route commands that specify a direction (in or out) do not reset BGP neighbor adjacencies.
When the switch has a neighbor configured with soft-reconfiguration inbound enabled, performing a clear in or soft clear in clears the routes in the soft reconfiguration table for the address family. This results in reevaluating routes in the BGP table against any applied input policies.
When the switch has a neighbor configured without the soft-reconfiguration inbound option enabled, performing a clear in or soft in sends the peer a route refresh message.
Outbound BGP clear commands (either out or soft out) readvertise all routes to BGP peers.
BGP soft clear commands that do not specify a direction (in or out) do not reset BGP neighbor adjacencies, and affect both inbound and outbound routes as described above depending on whether soft-reconfiguration inbound is enabled.
This section shows a BGP configuration example based on the reference topology. The example configures BGP unnumbered on all leafs and spines, and MLAG on leaf01 and leaf02, and on leaf03 and leaf04.
cumulus@leaf01:mgmt:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:mgmt:~$ nv set interface swp1-3,swp49-52
cumulus@leaf01:mgmt:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:mgmt:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:mgmt:~$ nv set interface bond3 bond member swp3
cumulus@leaf01:mgmt:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:mgmt:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf01:mgmt:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf01:mgmt:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf01:mgmt:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:mgmt:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:mgmt:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:mgmt:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:mgmt:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@leaf01:mgmt:~$ nv set interface vlan20 ip address 10.1.20.2/24
cumulus@leaf01:mgmt:~$ nv set interface vlan30 ip address 10.1.30.2/24
cumulus@leaf01:mgmt:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf01:mgmt:~$ nv set bridge domain br_default untagged 1
cumulus@leaf01:mgmt:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:mgmt:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:mgmt:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf01:mgmt:~$ nv set vrf default router bgp neighbor swp52 remote-as external
cumulus@leaf01:mgmt:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.1/32
cumulus@leaf01:mgmt:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf01:mgmt:~$ nv config apply
cumulus@leaf02:mgmt:~$ nv set interface lo ip address 10.10.10.2/32
cumulus@leaf02:mgmt:~$ nv set interface swp1-3,swp49-52
cumulus@leaf02:mgmt:~$ nv set interface bond1 bond member swp1
cumulus@leaf02:mgmt:~$ nv set interface bond2 bond member swp2
cumulus@leaf02:mgmt:~$ nv set interface bond3 bond member swp3
cumulus@leaf02:mgmt:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf02:mgmt:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf02:mgmt:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf02:mgmt:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf02:mgmt:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf02:mgmt:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:mgmt:~$ nv set mlag backup 10.10.10.1
cumulus@leaf02:mgmt:~$ nv set mlag peer-ip linklocal
cumulus@leaf02:mgmt:~$ nv set interface vlan10 ip address 10.1.10.3/24
cumulus@leaf02:mgmt:~$ nv set interface vlan20 ip address 10.1.20.3/24
cumulus@leaf02:mgmt:~$ nv set interface vlan30 ip address 10.1.30.3/24
cumulus@leaf02:mgmt:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf02:mgmt:~$ nv set bridge domain br_default untagged 1
cumulus@leaf02:mgmt:~$ nv set router bgp autonomous-system 65102
cumulus@leaf02:mgmt:~$ nv set router bgp router-id 10.10.10.2
cumulus@leaf02:mgmt:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf02:mgmt:~$ nv set vrf default router bgp neighbor swp52 remote-as external
cumulus@leaf02:mgmt:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.2/32
cumulus@leaf02:mgmt:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf02:mgmt:~$ nv config apply
cumulus@leaf03:mgmt:~$ nv set interface lo ip address 10.10.10.3/32
cumulus@leaf03:mgmt:~$ nv set interface swp1-3,swp49-52
cumulus@leaf03:mgmt:~$ nv set interface bond1 bond member swp1
cumulus@leaf03:mgmt:~$ nv set interface bond2 bond member swp2
cumulus@leaf03:mgmt:~$ nv set interface bond3 bond member swp3
cumulus@leaf03:mgmt:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf03:mgmt:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf03:mgmt:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf03:mgmt:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf03:mgmt:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf03:mgmt:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf03:mgmt:~$ nv set mlag backup 10.10.10.4
cumulus@leaf03:mgmt:~$ nv set mlag peer-ip linklocal
cumulus@leaf03:mgmt:~$ nv set interface vlan40 ip address 10.1.40.4/24
cumulus@leaf03:mgmt:~$ nv set interface vlan50 ip address 10.1.50.4/24
cumulus@leaf03:mgmt:~$ nv set interface vlan60 ip address 10.1.60.4/24
cumulus@leaf03:mgmt:~$ nv set bridge domain br_default vlan 40,50,60
cumulus@leaf03:mgmt:~$ nv set bridge domain br_default untagged 1
cumulus@leaf03:mgmt:~$ nv set router bgp autonomous-system 65103
cumulus@leaf03:mgmt:~$ nv set router bgp router-id 10.10.10.3
cumulus@leaf03:mgmt:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf03:mgmt:~$ nv set vrf default router bgp neighbor swp52 remote-as external
cumulus@leaf03:mgmt:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.3/32
cumulus@leaf03:mgmt:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf03:mgmt:~$ nv config apply
cumulus@leaf04:mgmt:~$ nv set interface lo ip address 10.10.10.4/32
cumulus@leaf04:mgmt:~$ nv set interface swp1-3,swp49-52
cumulus@leaf04:mgmt:~$ nv set interface bond1 bond member swp1
cumulus@leaf04:mgmt:~$ nv set interface bond2 bond member swp2
cumulus@leaf04:mgmt:~$ nv set interface bond3 bond member swp3
cumulus@leaf04:mgmt:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf04:mgmt:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf04:mgmt:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf04:mgmt:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf04:mgmt:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf04:mgmt:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf04:mgmt:~$ nv set mlag backup 10.10.10.3
cumulus@leaf04:mgmt:~$ nv set mlag peer-ip linklocal
cumulus@leaf04:mgmt:~$ nv set interface vlan40 ip address 10.1.40.5/24
cumulus@leaf04:mgmt:~$ nv set interface vlan50 ip address 10.1.50.5/24
cumulus@leaf04:mgmt:~$ nv set interface vlan60 ip address 10.1.60.5/24
cumulus@leaf04:mgmt:~$ nv set bridge domain br_default vlan 40,50,60
cumulus@leaf04:mgmt:~$ nv set bridge domain br_default untagged 1
cumulus@leaf04:mgmt:~$ nv set router bgp autonomous-system 65104
cumulus@leaf04:mgmt:~$ nv set router bgp router-id 10.10.10.4
cumulus@leaf04:mgmt:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf04:mgmt:~$ nv set vrf default router bgp neighbor swp52 remote-as external
cumulus@leaf04:mgmt:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.4/32
cumulus@leaf04:mgmt:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected
cumulus@leaf04:mgmt:~$ nv config apply
cumulus@spine01:mgmt:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:mgmt:~$ nv set interface swp1-4
cumulus@spine01:mgmt:~$ nv set router bgp autonomous-system 65199
cumulus@spine01:mgmt:~$ nv set router bgp router-id 10.10.10.101
cumulus@spine01:mgmt:~$ nv set vrf default router bgp neighbor swp1 remote-as external
cumulus@spine01:mgmt:~$ nv set vrf default router bgp neighbor swp2 remote-as external
cumulus@spine01:mgmt:~$ nv set vrf default router bgp neighbor swp3 remote-as external
cumulus@spine01:mgmt:~$ nv set vrf default router bgp neighbor swp4 remote-as external
cumulus@spine01:mgmt:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.101/32
cumulus@spine01:mgmt:~$ nv config apply
cumulus@spine02:mgmt:~$ nv set interface lo ip address 10.10.10.102/32
cumulus@spine02:mgmt:~$ nv set interface swp1-4
cumulus@spine02:mgmt:~$ nv set router bgp autonomous-system 65199
cumulus@spine02:mgmt:~$ nv set router bgp router-id 10.10.10.102
cumulus@spine02:mgmt:~$ nv set vrf default router bgp neighbor swp1 remote-as external
cumulus@spine02:mgmt:~$ nv set vrf default router bgp neighbor swp2 remote-as external
cumulus@spine02:mgmt:~$ nv set vrf default router bgp neighbor swp3 remote-as external
cumulus@spine02:mgmt:~$ nv set vrf default router bgp neighbor swp4 remote-as external
cumulus@spine02:mgmt:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.102/32
cumulus@spine02:mgmt:~$ nv config apply
The NVUE nv config save command saves the configuration in the /etc/nvue.d/startup.yaml file. For example:
cumulus@leaf03:mgmt:~$ sudo cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.3/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 1
auto bond2
iface bond2
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 2
auto bond3
iface bond3
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 3
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-backup-ip 10.10.10.4
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 180
auto vlan40
iface vlan40
address 10.1.40.4/24
hwaddress 44:38:39:22:01:bb
vlan-raw-device br_default
vlan-id 40
auto vlan50
iface vlan50
address 10.1.50.4/24
hwaddress 44:38:39:22:01:bb
vlan-raw-device br_default
vlan-id 50
auto vlan60
iface vlan60
address 10.1.60.4/24
hwaddress 44:38:39:22:01:bb
vlan-raw-device br_default
vlan-id 60
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 peerlink
hwaddress 44:38:39:22:01:bb
bridge-vlan-aware yes
bridge-vids 40 50 60
bridge-pvid 1
cumulus@leaf04:mgmt:~$ sudo cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.4/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp49
iface swp49
auto swp50
iface swp50
auto swp51
iface swp51
auto swp52
iface swp52
auto bond1
iface bond1
bond-slaves swp1
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 1
auto bond2
iface bond2
bond-slaves swp2
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 2
auto bond3
iface bond3
bond-slaves swp3
bond-mode 802.3ad
bond-lacp-bypass-allow no
clag-id 3
auto peerlink
iface peerlink
bond-slaves swp49 swp50
bond-mode 802.3ad
bond-lacp-bypass-allow no
auto peerlink.4094
iface peerlink.4094
clagd-peer-ip linklocal
clagd-backup-ip 10.10.10.3
clagd-sys-mac 44:38:39:BE:EF:AA
clagd-args --initDelay 180
auto vlan40
iface vlan40
address 10.1.40.5/24
hwaddress 44:38:39:22:01:c1
vlan-raw-device br_default
vlan-id 40
auto vlan50
iface vlan50
address 10.1.50.5/24
hwaddress 44:38:39:22:01:c1
vlan-raw-device br_default
vlan-id 50
auto vlan60
iface vlan60
address 10.1.60.5/24
hwaddress 44:38:39:22:01:c1
vlan-raw-device br_default
vlan-id 60
auto br_default
iface br_default
bridge-ports bond1 bond2 bond3 peerlink
hwaddress 44:38:39:22:01:c1
bridge-vlan-aware yes
bridge-vids 40 50 60
bridge-pvid 1
cumulus@spine01:mgmt:~$ sudo cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
cumulus@spine02:mgmt:~$ sudo cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.102/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
auto swp3
iface swp3
auto swp4
iface swp4
This simulation starts with the example BGP configuration. The demo is pre-configured using NVUE commands.
To validate the configuration, run the commands listed in the Troubleshooting-BGP section.
Open Shortest Path First - OSPF
OSPF is a link-state routing protocol you use between routers to exchange information about routes and the cost to reach an intended destination. OSPF routers exchange information about their links, prefixes, and associated cost with LSAs. This topology information builds a topology database. Each router within an area has an identical database and calculates its own routing table using SPF algorithm. Cumulus Linux uses the SPF algorithm any time there are changes to routing information in the network. OSPF uses the concept of areas to try and limit the size of the topology database on different routers. The routers that exist in more than one area are ABRs, which simplify the information in LSAs when advertising them from one area to another. ABRs are the routers in OSPF that implement route filtering or route summarization.
You can configure OSPF using either numbered interfaces or unnumbered interfaces.
OSPFv2 Numbered
To configure OSPF using numbered interfaces, you specify the router ID, IP subnet prefix, and area address. You must put all the interfaces on the switch with an IP address that matches the network subnet into the specified area. OSPF attempts to discover other OSPF routers on those interfaces. Cumulus Linux adds all matching interface network addresses to a type-1 LSA and advertises to discovered neighbors for proper reachability.
If you do not want to bring up an OSPF adjacency on certain interfaces, but want to advertise those networks in the OSPF database, you can configure the interfaces as passive interfaces. A passive interface creates a database entry but does not send or receive OSPF hello packets. For example, in a data center topology, the host-facing interfaces do not need to run OSPF, however, you need to advertise the corresponding IP addresses to neighbors.
Network statements can be as inclusive or generic as necessary to cover the interface networks.
The following example commands configure OSPF numbered on leaf01 and spine01.
leaf01
spine01
The loopback address is 10.10.10.1/32
The IP address on swp51 is 10.0.1.0/31
The router ID is 10.10.10.1
All the interfaces on the switch with an IP address that matches subnet 10.10.10.1/32 and swp51 with IP address 10.0.1.0/31 are in area 0
swp1 and swp2 are passive interfaces
The loopback address is 10.10.10.101/32
The IP address on swp1 is 10.0.1.1/31
The router ID is 10.10.10.101
All interfaces on the switch with an IP address that matches subnet 10.10.10.101/32 and swp1 with IP address 10.0.1.1/31 are in area 0.
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp51 ip address 10.0.1.0/31
cumulus@leaf01:~$ nv set vrf default router ospf router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router ospf area 0 network 10.10.10.1/32
cumulus@leaf01:~$ nv set vrf default router ospf area 0 network 10.0.1.0/31
cumulus@leaf01:~$ nv set interface swp1 router ospf passive on
cumulus@leaf01:~$ nv set interface swp2 router ospf passive on
cumulus@leaf01:~$ nv config apply
cumulus@spine01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp1 ip address 10.0.1.1/31
cumulus@spine01:~$ nv set vrf default router ospf router-id 10.10.10.101
cumulus@spine01:~$ nv set vrf default router ospf area 0 network 10.10.10.101/32
cumulus@spine01:~$ nv set vrf default router ospf area 0 network 10.0.1.1/31
cumulus@spine01:~$ nv config apply
Edit the /etc/frr/daemons file to enable the ospf daemon, then start the FRR service (see FRRouting).
Edit the /etc/network/interfaces file to configure the IP address for the loopback and swp51:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
auto swp51
iface swp51
address 10.0.1.0/31
Run the ifreload -a command to load the new configuration:
You can use the passive-interface default command to set all interfaces as passive and selectively bring up protocol adjacency on certain interfaces:
spine01(config)# router ospf
spine01(config-router)# passive-interface default
spine01(config-router)# no passive-interface swp1
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
router ospf
ospf router-id 10.10.10.101
network 10.10.10.101/32 area 0
network 10.0.1.1/31 area 0
...
OSPFv2 Unnumbered
Unnumbered interfaces are interfaces without unique IP addresses; multiple interfaces share the same IP address. In OSPFv2, unnumbered interfaces do not need unique IP addresses on leaf and spine interfaces and simplify the OSPF database, which reduces the memory footprint and improves SPF convergence times.
To configure an unnumbered interface, take the IP address of loopback interface (called the anchor) and use that as the IP address of the unnumbered interface.
The following example commands configure OSPF unnumbered on leaf01 and spine01.
leaf01
spine01
The loopback address is 10.10.10.1/32
The IP address of the unnumbered interface (swp51) is 10.10.10.1/32
The router ID is 10.10.10.1
OSPF is on the loopback interface and on swp51 in area 0
swp1 and swp2 are passive interfaces
swp51 is a point-to-point interface (Cumulus Linux requires point-to-point for unnumbered interfaces)
The loopback address is 10.10.10.101/32
The IP address of the unnumbered interface (swp1) is 10.10.10.101/32
The router ID is 10.10.10.101
OSPF is on the loopback interface and on swp1 in area 0
swp1 is a point-to-point interface (Cumulus Linux requires point-to-point for unnumbered interfaces)
Configure the unnumbered interface:
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp51 ip address 10.10.10.1/32
cumulus@leaf01:~$ nv config apply
Configure OSPF:
cumulus@leaf01:~$ nv set vrf default router ospf router-id 10.10.10.1
cumulus@leaf01:~$ nv set interface lo router ospf area 0
cumulus@leaf01:~$ nv set interface swp51 router ospf area 0
cumulus@leaf01:~$ nv set interface swp1 router ospf passive on
cumulus@leaf01:~$ nv set interface swp2 router ospf passive on
cumulus@leaf01:~$ nv set interface swp51 router ospf network-type point-to-point
cumulus@leaf01:~$ nv config apply
Configure the unnumbered interface:
cumulus@spine01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp1 ip address 10.10.10.101/32
cumulus@spine01:~$ nv config apply
Configure OSPF:
cumulus@spine01:~$ nv set vrf default router ospf router-id 10.10.10.101
cumulus@spine01:~$ nv set interface lo router ospf area 0
cumulus@spine01:~$ nv set interface swp1 router ospf area 0
cumulus@spine01:~$ nv set interface swp1 router ospf network-type point-to-point
cumulus@spine01:~$ nv config apply
Edit the /etc/frr/daemons file to enable the ospf daemon, then start the FRR service (see FRRouting).
Edit the /etc/network/interfaces file to configure the loopback and unnumbered interface address:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
auto swp51
iface swp51
address 10.10.10.1/32
Run the ifreload -a command to load the new configuration:
cumulus@leaf01:~$ ifreload -a
From the vtysh shell, configure OSPF:
cumulus@leaf01:~$ sudo vtysh
...
leaf01# configure terminal
leaf01(config)# router ospf
leaf01(config-router)# ospf router-id 10.10.10.1
leaf01(config-router)# interface swp51
leaf01(config-if)# ip ospf area 0
leaf01(config-if)# ip ospf network point-to-point
leaf01(config-if)# exit
leaf01(config)# interface lo
leaf01(config-if)# ip ospf area 0
leaf01(config-if)# exit
leaf01(config)# router ospf
leaf01(config-router)# passive-interface swp1,swp2
leaf01(config-router)# end
leaf01# write memory
leaf01# exit
You can use the passive-interface default command to set all interfaces as passive and selectively bring up protocol adjacency on certain interfaces:
leaf01(config)# router ospf
leaf01(config-router)# passive-interface default
leaf01(config-router)# no passive-interface swp51
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
interface lo
ip ospf area 0
interface swp51
ip ospf area 0
ip ospf network point-to-point
router ospf
ospf router-id 10.10.10.1
passive-interface swp1,swp2
...
Edit the /etc/frr/daemons file to enable the ospf daemon, then start the FRR service (see FRRouting).
Edit the /etc/network/interfaces file to configure the loopback and unnumbered interface address:
cumulus@spine01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto swp1
iface swp1
address 10.10.10.101/32
Run the ifreload -a command to load the new configuration:
cumulus@spine01:~$ sudo ifreload -a
From the vtysh shell, configure OSPF:
cumulus@spine01:~$ sudo vtysh
...
spine01# configure terminal
spine01(config)# router ospf
spine01(config)# ospf router-id 10.10.10.101
spine01(config)# interface swp1
spine01(config-if)# ip ospf area 0
spine01(config-if)# ip ospf network point-to-point
spine01(config-if)# exit
spine01(config)# interface lo
spine01(config-if)# ip ospf area 0
spine01(config-if)# exit
spine01(config-if)# end
spine01# write memory
spine01# exit
You can use the passive-interface default command to set all interfaces as passive and selectively bring up protocol adjacency on certain interfaces:
spine01(config)# router ospf
spine01(config-router)# passive-interface default
spine01(config-router)# no passive-interface swp1
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
interface lo
ip ospf area 0
interface swp1
ip ospf area 0
ip ospf network point-to-point
router ospf
ospf router-id 10.10.10.101
...
Optional OSPFv2 Configuration
This section describes optional configuration. The steps provided in this section assume that you already configured basic OSPFv2 as described in Basic OSPF Configuration, above.
Interface Parameters
You can define the following OSPF parameters per interface:
Network type (point-to-point or broadcast). Broadcast is the default setting. Configure the interface as point-to-point unless you intend to use the Ethernet media as a LAN with multiple connected routers. Point-to-point provides a simplified adjacency state machine so there is no need for DR/BDR election and LSA reflection. See RFC5309 for a more information.
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example
...
interface swp51
ip ospf network point-to-point
...
The following command example sets the hello interval to 5 seconds and the dead interval to 60 seconds. The hello interval and dead interval can be any value between 1 and 65535 seconds.
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# interface swp51
switch(config-if)# ip ospf network hello-interval 5
switch(config-if)# ip ospf network dead-interval 60
switch(config-if)# end
switch# write memory
switch# exit
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example
...
interface swp51
ip ospf hello-interval 5
ip ospf dead-interval 60
...
The following command example sets the priority to 5 for swp51. The priority can be any value between 0 to 255. 0 configures the interface to never become the OSPF Designated Router (DR) on a broadcast interface.
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example
...
interface swp51
ip ospf priority 5
...
To see the configured OSPF interface parameter values, run the vtysh show ip ospf interface command.
SPF Timer Defaults
OSPF uses the following default timers to prevent consecutive SPF from overburdening the CPU:
0 milliseconds from the initial event until SPF runs
50 milliseconds between consecutive SPF runs (the number doubles with each SPF, until it reaches the maximum time between SPF runs)
5000 milliseconds maximum between SPFs
The following example commands change the number of milliseconds from the initial event until SPF runs to 80, the number of milliseconds between consecutive SPF runs to 100, and the maximum number of milliseconds between SPFs to 6000.
To see the configured SPF timer values, run the vtysh show ip ospf command.
MD5 Authentication
To configure MD5 authentication on the switch, you need to create a key and a key ID, then enable MD5 authentication. The key ID must be a value between 1 and 255 that represents the key used to create the message digest. This value must be consistent across all routers on a link. The key must be a value with an upper range of 16 characters (longer strings truncate) that represents the actual message digest.
The following example commands create key ID 1 with the key thisisthekey and enable MD5 authentication on swp51 on leaf01 and on swp1 on spine01.
cumulus@leaf01:~$ nv set interface swp51 router ospf authentication message-digest-key 1
cumulus@leaf01:~$ nv set interface swp51 router ospf authentication md5-key thisisthekey
cumulus@leaf01:~$ nv set interface swp51 router ospf authentication enable on
cumulus@leaf01:~$ nv config apply
cumulus@spine01:~$ nv set interface swp1 router ospf authentication message-digest-key 1
cumulus@spine01:~$ nv set interface swp1 router ospf authentication md5-key thisisthekeynet
cumulus@spine01:~$ nv set interface swp1 router ospf authentication enable on
cumulus@spine01:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh
...
leaf01# configure terminal
leaf01(config)# interface swp51
leaf01(config-if)# ip ospf authentication message-digest
leaf01(config-if)# ip ospf message-digest-key 1 md5 thisisthekey
leaf01(config-if)# end
leaf01# write memory
leaf01# exit
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
interface swp51
ip ospf authentication message-digest
ip ospf message-digest-key 1 md5 thisisthekey
...
cumulus@spine01:~$ sudo vtysh
...
spine01# configure terminal
spine01(config)# interface swp1
spine01(config-if)# ip ospf authentication message-digest
spine01(config-if)# ip ospf message-digest-key 1 md5 thisisthekey
spine01(config-if)# end
spine01# write memory
spine01# exit
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
interface swp1
ip ospf authentication message-digest
ip ospf message-digest-key 1 md5 thisisthekey
...
To remove existing MD5 authentication hashes, run the vtysh no ip ospf command (no ip ospf message-digest-key 1 md5 thisisthekey).
Summarization and Prefix Range
By default, an ABR creates a summary (type-3) LSA for each route in an area and advertises it in adjacent areas. Prefix range configuration optimizes this behavior by creating and advertising one summary LSA for multiple routes. OSPF only allows for route summarization between areas on a ABR.
The following example shows a topology divided into area 0 and area 1. border01 and border02 are ABRs that have links to multiple areas and perform a set of specialized tasks, such as SPF computation per area and summarization of routes across areas.
On border01:
swp1 is in area 1 with IP addresses 10.0.0.24/31, 172.16.1.1/32, 172.16.1.2/32, and 172.16.1.3/32
swp51 is in area 0 with IP address 10.0.1.9/31
These commands create a summary route for all the routes in the range 172.16.1.0/24 in area 0:
cumulus@leaf01:~$ nv set vrf default router ospf area 0 range 172.16.1.0/24
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh
...
leaf01# configure terminal
leaf01(config)# router ospf
leaf01(config-router)# area 0 range 172.16.1.0/24
leaf01(config-router)# end
leaf01# write memory
leaf01# exit
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
cumulus@border01:mgmt:~$ sudo cat /etc/frr/frr.conf
...
interface lo
ip ospf area 0
interface swp1
ip ospf area 1
interface swp2
ip ospf area 1
interface swp51
ip ospf area 0
interface swp52
ip ospf area 0
router ospf
ospf router-id 10.10.10.63
area 0 range 172.16.1.0/24
Stub Areas
External routes are the routes redistributed into OSPF from another protocol. They have an AS-wide flooding scope. Typically, external link states make up a large percentage of the link-state database (LSDB). Stub areas reduce the LSDB size by not flooding AS-external LSAs.
All routers must agree that an area is a stub, otherwise they do not become OSPF neighbors.
To configure a stub area:
cumulus@switch:~$ nv set vrf default router ospf area 1 type stub
cumulus@switch:~$ nv config apply
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
router ospf
router-id 10.10.10.63
area 1 stub
...
Stub areas still receive information about networks that belong to other areas of the same OSPF domain. If summarization is not configured (or is not comprehensive), the information can be overwhelming for the nodes. Totally stubby areas address this issue. Routers in totally stubby areas keep information about routing within their area in their LSDB.
To configure a totally stubby area:
cumulus@switch:~$ nv set vrf default router ospf area 1 type totally-stub
cumulus@switch:~$ nv config apply
LSA types 1, 2, 3, 4 area-scoped, no type 5 externals, inter-area routes summarized
Totally stubby area
LSA types 1, 2 area-scoped, default summary, no type 3, 4, 5 LSA types allowed
Auto-cost Reference Bandwidth
When you set the auto-cost reference bandwidth, Cumulus Linux dynamically calculates the OSPF interface cost to support higher speed links. The default value is 100000 for 100Gbps link speed. The cost of interfaces with link speeds lower than 100Gbps is higher.
To avoid routing loops, set the bandwidth to a consistent value across all OSPF routers.
The following example commands configure the auto-cost reference bandwidth for 90Gbps link speed:
Cumulus Linux uses the administrative distance to choose which routing protocol to use when two different protocols provide route information for the same destination. The smaller the distance, the more reliable the protocol. For example, if the switch receives a route from OSPF with an administrative distance of 110 and the same route from BGP with an administrative distance of 100, the switch chooses BGP.
Cumulus Linux provides several commands to change the distance for OSPF routes. The default value is 110.
The following example commands set the distance for an entire group of routes:
The following example commands change the OSPF administrative distance to 150 for internal routes to a subnet or network inside the same area as the router:
The following example commands change the OSPF administrative distance to 150 for internal routes to a subnet in an area of which the router is not a part:
When you remove a router or OSPF interface, LSA updates trigger throughout the network to inform all routers of the topology change. When the switch receives the LSA and runs OSPF, a routing update occurs. This can cause short-duration outages while the network detects the failure and updates the OSPF database.
With a planned outage (such as during a maintenance window), you can configure the OSPF router with an OSPF max-metric to notify its neighbors not to use it as part of the OSPF topology. While the network converges, all traffic forwarded to the max-metric router is still forwarded. After you update the network, the max-metric router no longer receives any traffic and you can configure the max-metric setting. To remove a single interface, you can configure the OSPF cost for that specific interface.
For failure events, traffic loss can occur during reconvergence (until SPF on all nodes computes an alternative path around the failed link or node to each of the destinations).
To configure the max-metric (for all interfaces):
cumulus@switch:~$ nv set vrf default router ospf max-metric administrative on
cumulus@switch:~$ nv config apply
NVUE provides several commands to show OSPF interface and OSPF neighbor configuration and statistics.
The NVUE commands show brief output. To show more detailed operational information, run the NVUE commands with the --operational -o json option or run the vtysh commands.
The following NVUE nv show commands support OSPF numbered only.
Description
NVUE Command
nv show vrf <vrf> router ospf interface
Shows all OSPF interfaces.
nv show vrf <vrf> router ospf interface <interface>
Shows information about a specific OSPF interface.
nv show vrf <vrf> router ospf interface <interface> local-ip
Shows the local IP addresses for the specified OSPF interface.
nv show vrf <vrf> router ospf interface <interface> local-ip <IPv4_address>
Shows statistics for a specific OSPF interface local IP address.
nv show vrf <vrf> router ospf neighbor
Shows the OSPF neighbor ID and the OSPF interface for all OSPF neighbors.
nv show vrf <vrf> router ospf neighbor <IPv4-address>
Shows the interface and local IP addresses for a specific OSPF neighbor.
nv show vrf <vrf> router ospf neighbor <IPv4-address> interface
Shows the local IP addresses of all the interfaces for an OSPF neighbor.
FRR (vtysh) provides several OSPF troubleshooting commands:
Description
vtysh Command
show ip ospf neighbor
Shows OSPF neighbor information.
show ip ospf database
Shows if the LSDB synchronizes across all routers in the network.
show ip route ospf
Shows if Cumulus Linux does not forward an OSPF route properly.
show ip ospf interface
Shows OSPF interfaces.
show ip ospf
Shows information about the OSPF process.
The following example shows OSPF neighbor information:
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show ip ospf neighbor
Neighbor ID Pri State Dead Time Address Interface RXmtL RqstL DBsmL
10.10.10.101 1 Full/Backup 30.307s 10.0.1.1 swp51:10.0.1.0 0 0 0
The following example shows if Cumulus Linux does not forward an OSPF route properly:
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show ip route ospf
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR, f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route
O 10.0.1.0/31 [110/100] is directly connected, swp51, weight 1, 00:02:37
O 10.10.10.1/32 [110/0] is directly connected, lo, weight 1, 00:02:37
O>* 10.10.10.101/32 [110/100] via 10.0.1.1, swp51, weight 1, 00:00:57
To capture OSPF packets, run the sudo tcpdump -v -i swp1 ip proto ospf command.
Clear OSPF Counters
You can run the following commands to clear the OSPF counters shown in the NVUE show commands.
nv action clear vrf <vrf> router ospf interface clears all counters for all OSPF interfaces.
nv action clear vrf <vrf> router ospf interface <interface> clears all counters for a specific OSPF interface.
The following example command clears all counters for OSPF interface swp51:
OSPFv3 is a revised version of OSPFv2 and supports the IPv6 address family.
IETF has defined extensions to OSPFv3 to support multiple address families (both IPv6 and IPv4). FRR does not support multiple address families.
Basic OSPFv3 Configuration
You can configure OSPF using either numbered interfaces or unnumbered interfaces.
NVUE commands are not supported for OSPFv3.
OSPFv3 Numbered
To configure OSPF using numbered interfaces, you specify the router ID, IP subnet prefix, and area address. All the interfaces on the switch with an IP address that matches the network subnet go into the specified area. OSPF attempts to discover other OSPF routers on those interfaces. Cumulus Linux adds all matching interface network addresses to a Type-1 Router LSA and advertises to discovered neighbors for proper reachability.
If you do not want to bring up an OSPF adjacency on certain interfaces, but want to advertise those networks in the OSPF database, you can configure the interfaces as passive interfaces. A passive interface creates a database entry but does not send or receive OSPF hello packets. For example, in a data center topology, the host-facing interfaces do not need to run OSPF, however, you must advertise the corresponding IP addresses to neighbors.
The following example commands configure OSPF numbered on leaf01 and spine01.
leaf01
spine01
The loopback address is 2001:db8::a0a:0a01/128
The IP address on swp51 is 2001:db8::a00:0101/127
The router ID is 10.10.10.1
All the interfaces on the switch with an IP address that matches subnet 2001:db8::a0a:0a01/128 and swp51 with IP address 2001:db8::a00:0101/127 are in area 0.0.0.0
swp1 and swp2 are passive interfaces
The loopback address is 2001:db8::a0a:0a65/128
The IP address on swp1 is 22001:db8::a00:0100/127
The router ID is 10.10.10.101
All interfaces on the switch with an IP address that matches subnet 2001:db8::a0a:0a65/128 and swp1 with IP address 2001:db8::a00:0100/127 are in area 0.0.0.0.
Edit the /etc/frr/daemons file to enable the ospf6 daemon, then start the FRR service (see FRRouting).
Edit the /etc/network/interfaces file to configure the IP address for the loopback and swp51:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 2001:db8::a0a:0a01/128
auto swp51
iface swp51
address 2001:db8::a00:0101/127
Run the ifreload -a command to load the new configuration:
Edit the /etc/frr/daemons file to enable the ospf6 daemon, then start the FRR service (see FRRouting).
Edit the /etc/network/interfaces file to configure the IP address for the loopback and swp1:
cumulus@spine01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 2001:db8::a0a:0a65/128
auto swp1
iface swp1
address 2001:db8::a00:0100/127
Run the ifreload -a command to load the new configuration:
cumulus@spine01:~$ sudo ifreload -a
From the vtysh shell, configure OSPF:
cumulus@spine01:~$ sudo vtysh
...
spine01# configure terminal
spine01(config)# router ospf6
spine01(config-ospf6)# ospf6 router-id 10.10.10.101
spine01(config-ospf6)# interface lo area 0.0.0.0
spine01(config-ospf6)# interface swp1 area 0.0.0.0
spine01(config-ospf6)# end
spine01# write memory
spine01# exit
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
router ospf6
ospf6 router-id 10.10.10.1
interface lo area 0.0.0.0
interface swp51 area 0.0.0.0
interface swp1
ipv6 ospf6 passive
interface swp2
ipv6 ospf6 passive
...
...
router ospf6
ospf router-id 10.10.10.101
interface lo area 0.0.0.0
interface swp1 area 0.0.0.0
...
OSPFv3 Unnumbered
Unnumbered interfaces are interfaces without unique IP addresses; multiple interfaces share the same IP address.
To configure an unnumbered interface, take the IP address of another interface (called the anchor) and use that as the IP address of the unnumbered interface. The anchor is typically the loopback interface on the switch.
The following example commands configure OSPFv3 unnumbered on leaf01 and spine01.
leaf01
spine01
The loopback address is 2001:db8::a0a:0a01/128
The router ID is 10.10.10.1
OSPF is on the loopback interface and on swp51 in area 0.0.0.0
swp1 and swp2 are passive interfaces
swp51 is a point-to-point interface (unnumbered interfaces require point-to-point)
The loopback address is 2001:db8::a0a:0a65/128
The router ID is 10.10.10.101
OSPF is on the loopback interface and on swp1 in area 0.0.0.0
swp1 is a point-to-point interface (unnumbered interfaces require point-to-point)
Edit the /etc/frr/daemons file to enable the ospf6 daemon, then start the FRR service (see FRRouting).
Edit the /etc/network/interfaces file to configure the IP address for the loopback and swp51:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 2001:db8::a0a:0a01/128
auto swp1
iface swp1
address 2001:db8::a0a:0a01/128
Run the ifreload -a command to load the new configuration:
Edit the /etc/frr/daemons file to enable the ospf6 daemon, then start the FRR service (see FRRouting).
Edit the /etc/network/interfaces file to configure the IP address for the loopback and swp1:
cumulus@spine01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 2001:db8::a0a:0a65/128
auto swp1
iface swp1
address 2001:db8::a0a:0a65/128
Run the ifreload -a command to load the new configuration:
...
router ospf6
ospf6 router-id 10.10.10.101
interface lo area 0.0.0.0
interface swp1 area 0.0.0.0
interface swp1
ipv6 ospf6 network point-to-point
...
Optional OSPFv3 Configuration
This section describes optional configuration. The steps provided in this section assume that you already configured basic OSPFv3 as described in Basic OSPF Configuration, above.
Interface Parameters
You can define the following OSPF parameters per interface:
Network type (point-to-point or broadcast). Broadcast is the default setting. Configure the interface as point-to-point unless you intend to use the Ethernet media as a LAN with multiple connected routers. Point-to-point provides a simplified adjacency state machine so there is no need for DR/BDR election and LSA reflection. See RFC5309 for a more information.
The following command example sets the hello interval to 5 seconds, the dead interval to 60 seconds, and the priority to 5 for swp51. The hello interval and dead interval can be any value between 1 and 65535 seconds. The priority can be any value between 0 to 255 (0 configures the interface to never become the OSPF Designated Router (DR) on a broadcast interface).
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
interface swp51
ipv6 ospf6 cost 1
...
To show the configured OSPF interface parameter values, run the vtysh show ipv6 ospf6 interface command.
SPF Timer Defaults
OSPF3 uses the following default timers to prevent consecutive SPF from overburdening the CPU:
0 milliseconds from the initial event until SPF runs
50 milliseconds between consecutive SPF runs (the number doubles with each SPF, until it reaches the maximum time between SPF runs)
5000 milliseconds maximum between SPFs
The following example commands change the number of milliseconds from the initial event until SPF runs to 80, the number of milliseconds between consecutive SPF runs to 100, and the maximum number of milliseconds between SPFs to 6000.
To see the configured SPF timer values, run the vtysh show ipv6 ospf6 command.
Configure the OSPFv3 Area
You can use different areas to control routing. You can:
Limit an OSPFv3 area from reaching another area.
Manage the size of the routing table by creating a summary route for all the routes in a particular address range.
The following section provides command examples.
The following example command removes the 3:3::/64 route from the routing table. Without a route in the table, any destinations in that network are not reachable.
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# router ospf6
switch(config-ospf6)# area 0.0.0.0 range 3:3::/64 not-advertise
switch(config-ospf6)# end
switch# write memory
switch# exit
The following example command creates a summary route for all the routes in the range 2001::/64:
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# router ospf6
switch(config-ospf6)# area 0.0.0.0 range 2001::/64 advertise
switch(config-ospf6)# end
switch# write memory
switch# exit
You can also configure the cost for a summary route, which Cumulus Linux uses to determine the shortest paths to the destination. The value for cost must be between 0 and 16777215.
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# router ospf6
switch(config-ospf6)# area 0.0.0.0 range 2001::/64 cost 160
switch(config-ospf6)# end
switch# write memory
switch# exit
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
router ospf6
ospf6 router-id 10.10.10.1
area 0.0.0.0 range 3:3::/64 not-advertise
area 0.0.0.0 range 2001::/64 advertise
area 0.0.0.0 range 2001::/64 cost 160
...
Stub Areas
External routes are the routes redistributed into OSPF from another protocol. They have an AS-wide flooding scope. Typically, external link states make up a large percentage of the LSDB. Stub areas reduce the LSDB size by not flooding AS-external LSAs.
All routers must agree that an area is a stub, otherwise they do not become OSPF neighbors.
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
router ospf6
ospf6 router-id 10.10.10.63
area 0.0.0.1 stub
...
Stub areas still receive information about networks that belong to other areas of the same OSPF domain. If summarization is not configured (or is not comprehensive), the information can be overwhelming for the nodes. Totally stubby areas address this issue. Routers in totally stubby areas keep information about routing within their area in their LSDB.
LSA types 1, 2, 3, 4 area-scoped, no type 5 externals, inter-area routes summarized
Totally stubby area
LSA types 1, 2 area-scoped, default summary, no type 3, 4, 5 LSA types allowed
Auto-cost Reference Bandwidth
When you set the auto-cost reference bandwidth, Cumulus Linux dynamically calculates the OSPF interface cost to support higher speed links. The default value is 100000 for 100Gbps link speed. The cost of interfaces with link speeds lower than 100Gbps is higher.
To avoid routing loops, set the bandwidth to a consistent value across all OSPF routers.
The following example commands configure the auto-cost reference bandwidth for 90Gbps link speed:
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
router ospf6
ospf6 router-id 10.10.10.1
interface lo area 0.0.0.0
interface swp51 area 0.0.0.0
auto-cost reference-bandwidth 90000
...
Administrative Distance
Cumulus Linux uses the administrative distance to choose which routing protocol to use when two different protocols provide route information for the same destination. The smaller the distance, the more reliable the protocol. For example, if the switch receives a route from OSPFv3 with an administrative distance of 110 and the same route from BGP with an administrative distance of 100, the switch chooses BGP.
Cumulus Linux provides several commands to change the administrative distance for OSPF routes. The default value is 110.
This example command sets the distance for an entire group of routes, rather than a specific route.
The vtysh commands save the configuration to the /etc/frr/frr.conf file. For example:
...
router ospf6
ospf6 router-id 10.10.10.1
interface lo area 0.0.0.0
distance ospf6 intra-area 150 inter-area 150 external 220
...
Troubleshooting
Cumulus Linux provides several OSPFv3 troubleshooting commands:
To
vtysh Command
Show neighbor states
show ipv6 ospf6 neighbor
Verify that the LSDB is the same across all routers in the network
show ipv6 ospf6 database
Determine why Cumulus Linux does forward an OSPF route correctly
show ipv6 ospf6 route
Show OSPF interfaces
show ipv6 ospf6 interface
Help visualize the network view
show ipv6 ospf6 spf tree
Show information about the OSPFv3 process
show ipv6 ospf6
The following example shows the vtysh show ipv6 ospf6 neighbor command output:
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show ipv6 ospf6 neighbor
Neighbor ID Pri DeadTime State/IfState Duration I/F[State]
10.10.10.101 1 00:00:34 Full/BDR 00:02:58 swp51[DR]
The following example shows the vtysh show ipv6 ospf6 route command output:
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show ipv6 ospf6 route
Codes: K - kernel route, C - connected, S - static, R - RIPng,
O - OSPFv3, I - IS-IS, B - BGP, N - NHRP, T - Table,
v - VNC, V - VNC-Direct, A - Babel, D - SHARP, F - PBR,
f - OpenFabric,
> - selected route, * - FIB route, q - queued route, r - rejected route
O 2001:db8::a00:100/127 [110/100] is directly connected, swp51, weight 1, 00:00:20
O 2001:db8::a0a:a01/128 [110/10] is directly connected, lo, weight 1, 00:01:40
O>* 2001:db8::a0a:a65/128 [110/110] via fe80::4638:39ff:fe00:2, swp51, weight 1, 00:00:15
To capture OSPF packets, run the sudo tcpdump -v -i swp1 ip proto ospf6 command.
This section shows an OSPF configuration example based on the reference topology.
The example configuration configures:
OSPFv2 unnumbered on all leafs and spines
MLAG on leaf01 and leaf02, and on border01 and border02
leaf01, leaf02, spine01, and spine02 in area 0
border01 and border02 (ABRs) in area 0 and area 1
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp51 ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp52 ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface bond1 bond member swp1
cumulus@leaf01:~$ nv set interface bond2 bond member swp2
cumulus@leaf01:~$ nv set interface bond3 bond member swp3
cumulus@leaf01:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf01:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf01:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf01:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf01:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf01:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf01:~$ nv set mlag backup 10.10.10.2
cumulus@leaf01:~$ nv set mlag peer-ip linklocal
cumulus@leaf01:~$ nv set interface vlan10 ip address 10.1.10.2/24
cumulus@leaf01:~$ nv set interface vlan20 ip address 10.1.20.2/24
cumulus@leaf01:~$ nv set interface vlan30 ip address 10.1.30.2/24
cumulus@leaf01:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf01:~$ nv set interface vlan10 ip vrr mac-address 00:00:5e:00:01:00
cumulus@leaf01:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf01:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf01:~$ nv set interface vlan20 ip vrr mac-address 00:00:5e:00:01:00
cumulus@leaf01:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf01:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf01:~$ nv set interface vlan30 ip vrr mac-address 00:00:5e:00:01:00
cumulus@leaf01:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf01:~$ nv set bridge domain br_default untagged 1
cumulus@leaf01:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf01:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf01:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf01:~$ nv set vrf default router ospf router-id 10.10.10.1
cumulus@leaf01:~$ nv set interface lo router ospf area 0
cumulus@leaf01:~$ nv set interface swp51 router ospf area 0
cumulus@leaf01:~$ nv set interface swp52 router ospf area 0
cumulus@leaf01:~$ nv set interface swp51 router ospf network-type point-to-point
cumulus@leaf01:~$ nv set interface swp52 router ospf network-type point-to-point
cumulus@leaf01:~$ nv set interface swp51 router ospf timers hello-interval 5
cumulus@leaf01:~$ nv set interface swp51 router ospf timers dead-interval 60
cumulus@leaf01:~$ nv set interface swp52 router ospf timers hello-interval 5
cumulus@leaf01:~$ nv set interface swp52 router ospf timers dead-interval 60
cumulus@leaf01:~$ nv set interface vlan10 router ospf area 0
cumulus@leaf01:~$ nv set interface vlan20 router ospf area 0
cumulus@leaf01:~$ nv set interface vlan30 router ospf area 0
cumulus@leaf01:~$ nv set interface vlan10 router ospf passive on
cumulus@leaf01:~$ nv set interface vlan20 router ospf passive on
cumulus@leaf01:~$ nv set interface vlan30 router ospf passive on
cumulus@leaf01:~$ nv set router ospf timers spf delay 80
cumulus@leaf01:~$ nv set router ospf timers spf holdtime 100
cumulus@leaf01:~$ nv set router ospf timers spf max-holdtime 6000
cumulus@leaf01:~$ nv config apply
cumulus@leaf02:~$ nv set interface lo ip address 10.10.10.2/32
cumulus@leaf02:~$ nv set interface swp51 ip address 10.10.10.2/32
cumulus@leaf02:~$ nv set interface swp52 ip address 10.10.10.2/32
cumulus@leaf02:~$ nv set interface bond1 bond member swp1
cumulus@leaf02:~$ nv set interface bond2 bond member swp2
cumulus@leaf02:~$ nv set interface bond3 bond member swp3
cumulus@leaf02:~$ nv set interface bond1 bond mlag id 1
cumulus@leaf02:~$ nv set interface bond2 bond mlag id 2
cumulus@leaf02:~$ nv set interface bond3 bond mlag id 3
cumulus@leaf02:~$ nv set interface bond1 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond2 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond3 bond lacp-bypass on
cumulus@leaf02:~$ nv set interface bond1-3 bridge domain br_default
cumulus@leaf02:~$ nv set interface peerlink bond member swp49-50
cumulus@leaf02:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@leaf02:~$ nv set mlag backup 10.10.10.1
cumulus@leaf02:~$ nv set mlag peer-ip linklocal
cumulus@leaf02:~$ nv set interface vlan10 ip address 10.1.10.3/24
cumulus@leaf02:~$ nv set interface vlan20 ip address 10.1.20.3/24
cumulus@leaf02:~$ nv set interface vlan30 ip address 10.1.30.3/24
cumulus@leaf02:~$ nv set interface vlan10 ip vrr address 10.1.10.1/24
cumulus@leaf02:~$ nv set interface vlan10 ip vrr mac-address 00:00:5e:00:01:00
cumulus@leaf02:~$ nv set interface vlan10 ip vrr state up
cumulus@leaf02:~$ nv set interface vlan20 ip vrr address 10.1.20.1/24
cumulus@leaf02:~$ nv set interface vlan20 ip vrr mac-address 00:00:5e:00:01:00
cumulus@leaf02:~$ nv set interface vlan20 ip vrr state up
cumulus@leaf02:~$ nv set interface vlan30 ip vrr address 10.1.30.1/24
cumulus@leaf02:~$ nv set interface vlan30 ip vrr mac-address 00:00:5e:00:01:00
cumulus@leaf02:~$ nv set interface vlan30 ip vrr state up
cumulus@leaf02:~$ nv set bridge domain br_default vlan 10,20,30
cumulus@leaf02:~$ nv set bridge domain br_default untagged 1
cumulus@leaf02:~$ nv set interface bond1 bridge domain br_default access 10
cumulus@leaf02:~$ nv set interface bond2 bridge domain br_default access 20
cumulus@leaf02:~$ nv set interface bond3 bridge domain br_default access 30
cumulus@leaf02:~$ nv set vrf default router ospf router-id 10.10.10.2
cumulus@leaf02:~$ nv set interface lo router ospf area 0
cumulus@leaf02:~$ nv set interface swp51 router ospf area 0
cumulus@leaf02:~$ nv set interface swp52 router ospf area 0
cumulus@leaf02:~$ nv set interface swp51 router ospf network-type point-to-point
cumulus@leaf02:~$ nv set interface swp52 router ospf network-type point-to-point
cumulus@leaf02:~$ nv set interface swp51 router ospf timers hello-interval 5
cumulus@leaf02:~$ nv set interface swp51 router ospf timers dead-interval 60
cumulus@leaf02:~$ nv set interface swp52 router ospf timers hello-interval 5
cumulus@leaf02:~$ nv set interface swp52 router ospf timers dead-interval 60
cumulus@leaf02:~$ nv set interface vlan10 router ospf area 0
cumulus@leaf02:~$ nv set interface vlan20 router ospf area 0
cumulus@leaf02:~$ nv set interface vlan30 router ospf area 0
cumulus@leaf02:~$ nv set interface vlan10 router ospf passive on
cumulus@leaf02:~$ nv set interface vlan20 router ospf passive on
cumulus@leaf02:~$ nv set interface vlan30 router ospf passive on
cumulus@leaf02:~$ nv set router ospf timers spf delay 80
cumulus@leaf02:~$ nv set router ospf timers spf holdtime 100
cumulus@leaf02:~$ nv set router ospf timers spf max-holdtime 6000
cumulus@leaf02:~$ nv config apply
cumulus@spine01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp1 ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp2 ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp5 ip address 10.10.10.101/32
cumulus@spine01:~$ nv set interface swp6 ip address 10.10.10.101/32
cumulus@spine01:~$ nv set vrf default router ospf router-id 10.10.10.101
cumulus@spine01:~$ nv set interface lo router ospf area 0
cumulus@spine01:~$ nv set interface swp1 router ospf area 0
cumulus@spine01:~$ nv set interface swp1 router ospf network-type point-to-point
cumulus@spine01:~$ nv set interface swp1 router ospf timers hello-interval 5
cumulus@spine01:~$ nv set interface swp1 router ospf timers dead-interval 60
cumulus@spine01:~$ nv set interface swp2 router ospf area 0
cumulus@spine01:~$ nv set interface swp2 router ospf network-type point-to-point
cumulus@spine01:~$ nv set interface swp2 router ospf timers hello-interval 5
cumulus@spine01:~$ nv set interface swp2 router ospf timers dead-interval 60
cumulus@spine01:~$ nv set interface swp5 router ospf area 0
cumulus@spine01:~$ nv set interface swp5 router ospf network-type point-to-point
cumulus@spine01:~$ nv set interface swp5 router ospf timers hello-interval 5
cumulus@spine01:~$ nv set interface swp5 router ospf timers dead-interval 60
cumulus@spine01:~$ nv set interface swp6 router ospf area 0
cumulus@spine01:~$ nv set interface swp6 router ospf network-type point-to-point
cumulus@spine01:~$ nv set interface swp6 router ospf timers hello-interval 5
cumulus@spine01:~$ nv set interface swp6 router ospf timers dead-interval 60
cumulus@spine01:~$ nv set router ospf timers spf max-holdtime 6000
cumulus@spine01:~$ nv set router ospf timers spf holdtime 100
cumulus@spine01:~$ nv set router ospf timers spf max-holdtime 6000
cumulus@spine01:~$ nv config apply
cumulus@spine02:~$ nv set interface lo ip address 10.10.10.102/32
cumulus@spine02:~$ nv set interface swp1 ip address 10.10.10.102/32
cumulus@spine02:~$ nv set interface swp2 ip address 10.10.10.102/32
cumulus@spine02:~$ nv set interface swp5 ip address 10.10.10.102/32
cumulus@spine02:~$ nv set interface swp6 ip address 10.10.10.102/32
cumulus@spine02:~$ nv set vrf default router ospf router-id 10.10.10.102
cumulus@spine02:~$ nv set interface lo router ospf area 0
cumulus@spine02:~$ nv set interface swp1 router ospf area 0
cumulus@spine02:~$ nv set interface swp1 router ospf network-type point-to-point
cumulus@spine02:~$ nv set interface swp1 router ospf timers hello-interval 5
cumulus@spine02:~$ nv set interface swp1 router ospf timers dead-interval 60
cumulus@spine02:~$ nv set interface swp2 router ospf area 0
cumulus@spine02:~$ nv set interface swp2 router ospf network-type point-to-point
cumulus@spine02:~$ nv set interface swp2 router ospf timers hello-interval 5
cumulus@spine02:~$ nv set interface swp2 router ospf timers dead-interval 60
cumulus@spine02:~$ nv set interface swp5 router ospf area 0
cumulus@spine02:~$ nv set interface swp5 router ospf network-type point-to-point
cumulus@spine02:~$ nv set interface swp5 router ospf timers hello-interval 5
cumulus@spine02:~$ nv set interface swp5 router ospf timers dead-interval 60
cumulus@spine02:~$ nv set interface swp6 router ospf area 0
cumulus@spine02:~$ nv set interface swp6 router ospf network-type point-to-point
cumulus@spine02:~$ nv set interface swp6 router ospf timers hello-interval 5
cumulus@spine02:~$ nv set interface swp6 router ospf timers dead-interval 60
cumulus@spine02:~$ nv set router ospf timers spf max-holdtime 6000
cumulus@spine02:~$ nv set router ospf timers spf holdtime 100
cumulus@spine02:~$ nv set router ospf timers spf max-holdtime 6000
cumulus@spine02:~$ nv config apply
cumulus@border01:~$ nv set interface lo ip address 10.10.10.63/32
cumulus@border01:~$ nv set interface swp51 ip address 10.10.10.63/32
cumulus@border01:~$ nv set interface swp52 ip address 10.10.10.63/32
cumulus@border01:~$ nv set interface bond1 bond member swp1
cumulus@border01:~$ nv set interface bond2 bond member swp2
cumulus@border01:~$ nv set interface bond1 bond mlag id 1
cumulus@border01:~$ nv set interface bond2 bond mlag id 2
cumulus@border01:~$ nv set interface bond1 bond lacp-bypass on
cumulus@border01:~$ nv set interface bond2 bond lacp-bypass on
cumulus@border01:~$ nv set interface bond1 bridge domain br_default access 2001
cumulus@border01:~$ nv set interface bond2 bridge domain br_default access 2001
cumulus@border01:~$ nv set interface bond1-2 bridge domain br_default
cumulus@border01:~$ nv set interface vlan2001
cumulus@border01:~$ nv set interface vlan2001 ip address 10.1.201.2/24
cumulus@border01:~$ nv set interface vlan2001 ip vrr address 10.1.201.1/24
cumulus@border01:~$ nv set interface vlan2001 ip vrr mac-address 00:00:5e:00:01:00
cumulus@border01:~$ nv set interface vlan2001 ip vrr state up
cumulus@border01:~$ nv set interface peerlink bond member swp49-50
cumulus@border01:~$ nv set mlag mac-address 44:38:39:BE:EF:FF
cumulus@border01:~$ nv set mlag backup 10.10.10.64
cumulus@border01:~$ nv set mlag peer-ip linklocal
cumulus@border01:~$ nv set bridge domain br_default untagged 1
cumulus@border01:~$ nv set vrf default router ospf router-id 10.10.10.63
cumulus@border01:~$ nv set interface lo router ospf area 0
cumulus@border01:~$ nv set interface swp51 router ospf area 0
cumulus@border01:~$ nv set interface swp51 router ospf network-type point-to-point
cumulus@border01:~$ nv set interface swp51 router ospf timers hello-interval 5
cumulus@border01:~$ nv set interface swp51 router ospf timers dead-interval 60
cumulus@border01:~$ nv set interface swp52 router ospf area 0
cumulus@border01:~$ nv set interface swp52 router ospf network-type point-to-point
cumulus@border01:~$ nv set interface swp52 router ospf timers hello-interval 5
cumulus@border01:~$ nv set interface swp52 router ospf timers dead-interval 60
cumulus@border01:~$ nv set interface vlan2001 router ospf area 1
cumulus@border01:~$ nv set router ospf timers spf max-holdtime 6000
cumulus@border01:~$ nv set router ospf timers spf holdtime 100
cumulus@border01:~$ nv set router ospf timers spf max-holdtime 6000
cumulus@border01:~$ nv config apply
cumulus@border02:~$ nv set interface lo ip address 10.10.10.64/32
cumulus@border02:~$ nv set interface swp51 ip address 10.10.10.64/32
cumulus@border02:~$ nv set interface swp52 ip address 10.10.10.64/32
cumulus@border02:~$ nv set interface bond1 bond member swp1
cumulus@border02:~$ nv set interface bond2 bond member swp2
cumulus@border02:~$ nv set interface bond1 bond mlag id 1
cumulus@border02:~$ nv set interface bond2 bond mlag id 2
cumulus@border02:~$ nv set interface bond1 bond lacp-bypass on
cumulus@border02:~$ nv set interface bond2 bond lacp-bypass on
cumulus@border02:~$ nv set interface bond1 bridge domain br_default access 2001
cumulus@border02:~$ nv set interface bond2 bridge domain br_default access 2001
cumulus@border02:~$ nv set interface bond1-2 bridge domain br_default
cumulus@border02:~$ nv set interface vlan2001
cumulus@border02:~$ nv set interface vlan2001 ip address 10.1.201.3/24
cumulus@border02:~$ nv set interface vlan2001 ip vrr address 10.1.201.1/24
cumulus@border02:~$ nv set interface vlan2001 ip vrr mac-address 00:00:5e:00:01:00
cumulus@border02:~$ nv set interface vlan2001 ip vrr state up
cumulus@border02:~$ nv set interface peerlink bond member swp49-50
cumulus@border02:~$ nv set mlag mac-address 44:38:39:BE:EF:FF
cumulus@border02:~$ nv set mlag backup 10.10.10.63
cumulus@border02:~$ nv set mlag peer-ip linklocal
cumulus@border02:~$ nv set bridge domain br_default untagged 1
cumulus@border02:~$ nv set vrf default router ospf router-id 10.10.10.64
cumulus@border02:~$ nv set interface lo router ospf area 0
cumulus@border02:~$ nv set interface swp51 router ospf area 0
cumulus@border02:~$ nv set interface swp51 router ospf network-type point-to-point
cumulus@border02:~$ nv set interface swp51 router ospf timers hello-interval 5
cumulus@border02:~$ nv set interface swp51 router ospf timers dead-interval 60
cumulus@border02:~$ nv set interface swp52 router ospf area 0
cumulus@border02:~$ nv set interface swp52 router ospf network-type point-to-point
cumulus@border02:~$ nv set interface swp52 router ospf timers hello-interval 5
cumulus@border02:~$ nv set interface swp52 router ospf timers dead-interval 60
cumulus@border02:~$ nv set interface vlan2001 router ospf area 1
cumulus@border02:~$ nv set router ospf timers spf max-holdtime 6000
cumulus@border02:~$ nv set router ospf timers spf holdtime 100
cumulus@border02:~$ nv set router ospf timers spf max-holdtime 6000
cumulus@border02:~$ nv config apply
cumulus@leaf01:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
untagged: 1
vlan:
'10': {}
'20': {}
'30': {}
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
mlag:
enable: on
id: 1
bridge:
domain:
br_default:
access: 10
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
mlag:
enable: on
id: 2
bridge:
domain:
br_default:
access: 20
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
mlag:
enable: on
id: 3
bridge:
domain:
br_default:
access: 30
type: bond
lo:
ip:
address:
10.10.10.1/32: {}
router:
ospf:
area: 0
enable: on
type: loopback
peerlink:
bond:
member:
swp49: {}
swp50: {}
type: peerlink
peerlink.4094:
base-interface: peerlink
type: sub
vlan: 4094
swp51:
ip:
address:
10.10.10.1/32: {}
router:
ospf:
area: 0
enable: on
network-type: point-to-point
timers:
dead-interval: 60
hello-interval: 5
type: swp
swp52:
ip:
address:
10.10.10.1/32: {}
router:
ospf:
area: 0
enable: on
network-type: point-to-point
timers:
dead-interval: 60
hello-interval: 5
type: swp
vlan10:
ip:
address:
10.1.10.2/24: {}
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:5e:00:01:00
state:
up: {}
router:
ospf:
area: 0
enable: on
passive: on
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.2/24: {}
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:5e:00:01:00
state:
up: {}
router:
ospf:
area: 0
enable: on
passive: on
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.2/24: {}
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:5e:00:01:00
state:
up: {}
router:
ospf:
area: 0
enable: on
passive: on
type: svi
vlan: 30
mlag:
backup:
10.10.10.2: {}
enable: on
mac-address: 44:38:39:BE:EF:AA
peer-ip: linklocal
router:
ospf:
enable: on
timers:
spf:
delay: 80
holdtime: 100
max-holdtime: 6000
vrr:
enable: on
system:
hostname: leaf01
vrf:
default:
router:
ospf:
enable: on
router-id: 10.10.10.1
cumulus@leaf02:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
untagged: 1
vlan:
'10': {}
'20': {}
'30': {}
interface:
bond1:
bond:
lacp-bypass: on
member:
swp1: {}
mlag:
enable: on
id: 1
bridge:
domain:
br_default:
access: 10
type: bond
bond2:
bond:
lacp-bypass: on
member:
swp2: {}
mlag:
enable: on
id: 2
bridge:
domain:
br_default:
access: 20
type: bond
bond3:
bond:
lacp-bypass: on
member:
swp3: {}
mlag:
enable: on
id: 3
bridge:
domain:
br_default:
access: 30
type: bond
lo:
ip:
address:
10.10.10.2/32: {}
router:
ospf:
area: 0
enable: on
type: loopback
peerlink:
bond:
member:
swp49: {}
swp50: {}
type: peerlink
peerlink.4094:
base-interface: peerlink
type: sub
vlan: 4094
swp51:
ip:
address:
10.10.10.2/32: {}
router:
ospf:
area: 0
enable: on
network-type: point-to-point
timers:
dead-interval: 60
hello-interval: 5
type: swp
swp52:
ip:
address:
10.10.10.2/32: {}
router:
ospf:
area: 0
enable: on
network-type: point-to-point
timers:
dead-interval: 60
hello-interval: 5
type: swp
vlan10:
ip:
address:
10.1.10.3/24: {}
vrr:
address:
10.1.10.1/24: {}
enable: on
mac-address: 00:00:5e:00:01:00
state:
up: {}
router:
ospf:
area: 0
enable: on
passive: on
type: svi
vlan: 10
vlan20:
ip:
address:
10.1.20.3/24: {}
vrr:
address:
10.1.20.1/24: {}
enable: on
mac-address: 00:00:5e:00:01:00
state:
up: {}
router:
ospf:
area: 0
enable: on
passive: on
type: svi
vlan: 20
vlan30:
ip:
address:
10.1.30.3/24: {}
vrr:
address:
10.1.30.1/24: {}
enable: on
mac-address: 00:00:5e:00:01:00
state:
up: {}
router:
ospf:
area: 0
enable: on
passive: on
type: svi
vlan: 30
mlag:
backup:
10.10.10.1: {}
enable: on
mac-address: 44:38:39:BE:EF:AA
peer-ip: linklocal
router:
ospf:
enable: on
timers:
spf:
delay: 80
holdtime: 100
max-holdtime: 6000
vrr:
enable: on
system:
hostname: leaf02
vrf:
default:
router:
ospf:
enable: on
router-id: 10.10.10.2
cumulus@spine01:~$ cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
address 10.10.10.101/32
auto swp2
iface swp2
address 10.10.10.101/32
auto swp5
iface swp5
address 10.10.10.101/32
auto swp6
iface swp6
address 10.10.10.101/32
cumulus@spine02:~$ cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.102/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
address 10.10.10.102/32
auto swp2
iface swp2
address 10.10.10.102/32
auto swp5
iface swp5
address 10.10.10.102/32
auto swp6
iface swp6
address 10.10.10.102/32
cumulus@border01:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.63/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
cumulus@leaf01:~$ sudo cat /etc/frr/frr.conf
...
interface lo
ip ospf area 0
interface swp51
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface swp52
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface vlan10
ip ospf area 0
router ospf
passive-interface vlan10
interface vlan20
ip ospf area 0
router ospf
passive-interface vlan20
interface vlan30
ip ospf area 0
router ospf
passive-interface vlan30
vrf default
exit-vrf
vrf mgmt
exit-vrf
router ospf
ospf router-id 10.10.10.1
timers throttle spf 80 100 6000
! end of router ospf block
cumulus@leaf02:~$ sudo cat /etc/frr/frr.conf
...
interface lo
ip ospf area 0
interface swp51
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface swp52
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface vlan10
ip ospf area 0
router ospf
passive-interface vlan10
interface vlan20
ip ospf area 0
router ospf
passive-interface vlan20
interface vlan30
ip ospf area 0
router ospf
passive-interface vlan30
vrf default
exit-vrf
vrf mgmt
exit-vrf
router ospf
ospf router-id 10.10.10.2
timers throttle spf 80 100 6000
! end of router ospf block
cumulus@spine01:~$ sudo cat /etc/frr/frr.conf
...
interface lo
ip ospf area 0
interface swp1
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface swp2
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface swp5
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface swp6
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
vrf default
exit-vrf
vrf mgmt
exit-vrf
router ospf
ospf router-id 10.10.10.101
timers throttle spf 0 100 6000
! end of router ospf block
cumulus@spine02:~$ sudo cat /etc/frr/frr.conf
...
interface lo
ip ospf area 0
interface swp1
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface swp2
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface swp5
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface swp6
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
vrf default
exit-vrf
vrf mgmt
exit-vrf
router ospf
ospf router-id 10.10.10.102
timers throttle spf 0 100 6000
! end of router ospf block
cumulus@border01:~$ sudo cat /etc/frr/frr.conf
...
interface lo
ip ospf area 0
interface swp51
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface swp52
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface vlan2001
ip ospf area 1
vrf default
exit-vrf
vrf mgmt
exit-vrf
router ospf
ospf router-id 10.10.10.63
timers throttle spf 0 100 6000
! end of router ospf block
cumulus@border02:~$ sudo cat /etc/frr/frr.conf
...
interface lo
ip ospf area 0
interface swp51
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface swp52
ip ospf area 0
ip ospf network point-to-point
ip ospf hello-interval 5
ip ospf dead-interval 60
interface vlan2001
ip ospf area 1
vrf default
exit-vrf
vrf mgmt
exit-vrf
router ospf
ospf router-id 10.10.10.64
timers throttle spf 0 100 6000
! end of router ospf block
This simulation starts with the example OSPF configuration. The demo is pre-configured using NVUE commands.
To validate the configuration, run the commands listed in the Troubleshooting section.
Virtual routing and forwarding (VRF) enables you to use multiple independent routing tables that work simultaneously on the same switch. Other implementations call this feature VRF-Lite.
You typically use VRFs in the data center to carry multiple isolated traffic streams for multi-tenant environments. The traffic streams can cross over only at configured boundary points, such as a firewall or IDS. You can also use VRFs to burst traffic from private clouds to enterprise networks where the burst point is at layer 3.
VRF is fully supported in the Linux kernel and has the following characteristics:
The VRF is a layer 3 master network device with its own associated routing table.
You can associate any layer 3 interface with a VRF, such as an SVI, swp port or bond, or a VLAN subinterface of a swp port or bond.
The layer 3 interfaces associated with the VRF belong to that VRF; IP rules direct FIB lookups to the routing table for the VRF device.
The VRF device can have its own IP address, known as a VRF-local loopback.
By default, applications on the switch run against the default VRF. Services started by systemd run in the default VRF unless you use the VRF instance.
Connected and local routes go in appropriate VRF tables.
Neighbor entries continue to be per-interface. You can view all entries for a VRF device.
A VRF does not map to its own network namespace; however, you can nest VRFs in a network namespace.
You can use existing Linux tools, such as tcpdump, to interact with a VRF.
Configure a VRF
Cumulus Linux calls each routing table a VRF table, which has its own table ID.
To configure VRF, you associate a subset of interfaces to a VRF routing table and configure an instance of the routing protocol (BGP or OSPFv2) for each routing table. Configuring a VRF is similar to configuring other network interfaces. Keep in mind the following:
A VRF table can have an IP address, which is a loopback interface for the VRF.
Cumulus Linux adds the associated rules automatically.
You can add a default route to avoid skipping across tables when the kernel forwards a packet.
VRF table names can be a maximum of 15 characters. However, you cannot use the name mgmt; Cumulus Linux uses this name for the management VRF.
Cumulus Linux supports up to 255 VRFs on a switch.
The following example commands configure VRF BLUE and assigns a table ID automatically.
cumulus@switch:~$ nv set vrf BLUE table auto
cumulus@switch:~$ nv set interface swp1 ip vrf BLUE
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file to add the VRF and assign a table ID automatically:
...
auto swp1
iface swp1
vrf BLUE
auto BLUE
iface BLUE
vrf-table auto
...
To load the new configuration, run ifreload -a:
cumulus@switch:~$ sudo ifreload -a
Specify a Table ID
Instead of assigning a table ID for the VRF automatically, you can specify your own table ID in the configuration. Cumulus Linux saves the table ID to name mapping in the /etc/iproute2/rt_tables.d/ directory. Instead of using the auto option as shown above, specify the table ID. For example:
cumulus@switch:~$ nv set vrf BLUE table 1016
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file:
...
auto swp1
iface swp1
vrf BLUE
auto BLUE
iface BLUE
vrf-table 1016
...
To load the new configuration, run ifreload -a:
cumulus@switch:~$ sudo ifreload -a
The table ID range must be between 1001 to 1255. Cumulus Linux reserves this range for VRF table IDs.
Bring a VRF Up After You Run ifdown
If you take down a VRF using ifdown, run one of the following commands to bring the VRF back up:
ifup --with-depends <vrf-name>
ifreload -a
For example:
cumulus@switch:~$ sudo ifdown BLUE
cumulus@switch:~$ sudo ifup --with-depends BLUE
Use the vrf Command
Run the vrf command to show information about VRF tables not available in other Linux commands, such as iproute.
To show a list of VRF tables, run the vrf list command:
cumulus@switch:~$ vrf list
VRF Table
---------------- -----
BLUE 1016
To show a list of processes and PIDs for a specific VRF table, run the ip vrf pids <vrf-name> command. For example:
cumulus@switch:~$ ip vrf pids BLUE
VRF: BLUE
-----------------------
dhclient 2508
sshd 2659
bash 2681
su 2702
bash 2720
vrf 2829
To determine which VRF table associates with a particular PID, run the ip vrf identify <pid> command. For example:
cumulus@switch:~$ ip vrf identify 2829
BLUE
IPv4 and IPv6 Commands in a VRF Context
You can execute non-VRF-specific Linux commands and perform other tasks against a given VRF table. This typically applies to single-use commands started from a login shell, as they affect only AF_INET and AF_INET6 sockets opened by the command that executes; it has no impact on netlink sockets, associated with the ip command.
To execute such a command against a VRF table, run ip vrf exec <vrf-name> <command>. For example, to SSH from the switch to a device accessible through VRF BLUE:
cumulus@switch:~$ sudo ip vrf exec BLUE ssh user@host
Services in VRFs
For services that need to run against a specific VRF, Cumulus Linux uses systemd instances, where the instance is the VRF. You start a service within a VRF with the systemctl start <service>@<vrf-name> command. For example, to run the dhcpd service in the BLUE VRF:
cumulus@switch:~$ sudo systemctl start dhcpd@BLUE
In most cases, you need to stop the instance running in the default VRF before a VRF instance can start. This is because the instance running in the default VRF owns the port across all VRFs (it is VRF global). Cumulus Linux stops systemd-based services when you restart networking or run an ifdown/ifup sequence. Refer to management VRF for details.
The following services work with VRF instances:
chef-client
collectd
dhcpd
dhcrelay
hsflowd
netq-agent
ntp (can only run in the default or management VRF)
puppet
snmptrapd
ssh
zabbix-agent
If systemd instances do not work; use a service-specific configuration option instead. For example, to configure rsyslogd to send messages to remote systems over a VRF:
action(type="omfwd" Target="hostname or ip here" Device="mgmt" Port=514
Protocol="udp")
VRF Route Leaking
You typically use VRFs when you want multiple independent routing and forwarding tables; however, you might want to reach destinations in one VRF from another VRF, as in the following cases:
To make a service, such as a firewall available to multiple VRFs.
To enable routing to external networks or the Internet for multiple VRFs, where the external network itself is reachable through a specific VRF.
You can assign an interface to only one VRF; Cumulus Linux routes any packets arriving on that interface using the associated VRF routing table.
You cannot route leak overlapping addresses.
You can use VRF route leaking with EVPN in a symmetric routing configuration only.
You cannot use VRF route leaking between the tenant VRF and the default VRF with onlink next hops (BGP unnumbered).
Configure Route Leaking
With route leaking, a destination VRF wants to know the routes of a source VRF. As routes come and go in the source VRF, they dynamically leak to the destination VRF through BGP. If BGP learns the routes in the source VRF, you do not need to perform any additional configuration. If OSPF learns the routes in the source VRF, if you configure the routes statically, or you need to reach directly connected networks, you need to redistribute the routes first into BGP (in the source VRF).
You can also use route leaking to reach remote destinations as well as directly connected destinations in another VRF. Multiple VRFs can import routes from a single source VRF and a VRF can import routes from multiple source VRFs. You can use this method when a single VRF provides connectivity to external networks or a shared service for other VRFs. You can control the routes leaked dynamically across VRFs with a route map.
Because route leaking happens through BGP, the underlying mechanism relies on the BGP constructs of the Route Distinguisher (RD) and Route Targets (RTs). However, you do not need to configure these parameters; Cumulus Linux derives them automatically when you enable route leaking between a pair of VRFs.
When you use route leaking:
You cannot reach the loopback address of a VRF (the address assigned to the VRF device) from another VRF.
You must use the redistribute command in BGP to leak non-BGP routes (connected or static routes); you cannot use the network command.
Cumulus Linux does not leak routes in the management VRF with the next hop as eth0 or the management interface.
You can leak routes in a VRF that iBGP or multi-hop eBGP learns even if their next hops become unreachable. NVIDIA recommends route leaking for routes that BGP learns through single-hop eBGP.
You cannot configure VRF instances of BGP in multiple autonomous systems (AS) or an AS that is not the same as the global AS.
Do not use the default VRF as a shared service VRF. Create another VRF for shared services.
An EVPN symmetric routing configuration has certain limitations when leaking routes between the default VRF and non-default VRFs. The default VRF has routes to VTEP addresses that you cannot leak to any tenant VRFs. If you need to leak routes between the default VRF and a non-default VRF, you must filter out routes to the VTEP addresses to prevent leaking these routes. Use caution with such a configuration. Run common services in a separate VRF (service VRF) instead of the default VRF to simplify configuration and avoid using route maps for filtering.
Cumulus Linux does not copy extended communities to the destination VRF.
In the following example commands, routes in the BGP routing table of VRF BLUE dynamically leak into VRF RED.
cumulus@switch:~$ nv set vrf RED router bgp address-family ipv4-unicast route-import from-vrf list BLUE
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# router bgp 65001 vrf RED
switch(config-router)# address-family ipv4 unicast
switch(config-router-af)# import vrf BLUE
switch(config-router-af)# end
switch# write memory
switch# exit
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
router bgp 65001 vrf RED
!
address-family ipv4 unicast
import vrf BLUE
...
Exclude Certain Prefixes
To exclude certain prefixes from the import process, configure the prefixes in a route map.
The following example configures a route map to match the source protocol BGP and imports the routes from VRF BLUE to VRF RED. For the imported routes, the community is 11:11 in VRF RED.
cumulus@switch:~$ nv set vrf RED router bgp address-family ipv4-unicast route-import from-vrf list BLUE
cumulus@switch:~$ nv set router policy route-map BLUEtoRED rule 10 match type ipv4
cumulus@switch:~$ nv set router policy route-map BLUEtoRED rule 10 match source-protocol bgp
cumulus@switch:~$ nv set router policy route-map BLUEtoRED rule 10 action permit
cumulus@switch:~$ nv set router policy route-map BLUEtoRED rule 10 set community 11:11
cumulus@switch:~$ nv set vrf RED router bgp address-family ipv4-unicast route-import from-vrf route-map BLUEtoRED
cumulus@switch:~$ nv config
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# router bgp 65001 vrf RED
switch(config-router)# address-family ipv4 unicast
switch(config-router-af)# import vrf BLUE
switch(config-router-af)# route-map BLUEtoRED permit 10
switch(config-route-map)# match source-protocol bgp
switch(config-route-map)# set community 11:11
switch(config-route-map)# exit
switch(config)# router bgp 65001 vrf RED
switch(config-router)# address-family ipv4 unicast
switch(config-router-af)# import vrf route-map BLUEtoRED
switch(config-router-af)# end
switch# write memory
switch# exit
Routes from eBGP Multihop Neighbors
If the routes you want to leak are connected routes sourced from an eBGP multihop neighbor, you must disable the next hop connection verification process for eBGP multihop peering sessions in the target VRF so that Cumulus Linux can add these routes to the routing table.
To disable the next hop connection verification process, you need to run vtysh commands; NVUE does not provide commands for this option.
The following example disables the next hop connection verification process for eBGP multihop peering sessions in the target VRF BLUE:
If you need to force Cumulus Linux to reimport the routes into the target VRF, run the clear ip bgp vrf <source-vrf> * command on the VRF from which you are leaking routes.
Verify Route Leaking Configuration
To check the status of VRF route leaking, run the NVUE nv show vrf <vrf-name> router bgp address-family ipv4-unicast route-import command or the vtysh show ip bgp vrf <vrf-name> ipv4|ipv6 unicast route-leak command. For example:
cumulus@switch:~$ nv show vrf RED router bgp address-family ipv4-unicast route-import
operational applied
-------------- ------------ ---------
from-vrf
enable on
route-map BLUEtoRED
[list] BLUE BLUE
[route-target] 10.10.10.1:3
To show more detailed status information, you can run the following NVUE commands:
nv show vrf <vrf-name> router bgp address-family ipv4-unicast route-import from-vrf
nv show vrf <vrf-name> router bgp address-family ipv4-unicast route-import from-vrf list
nv show vrf <vrf-name> router bgp address-family ipv4-unicast route-import from-vrf list <leak-vrf-id>
To view the BGP routing table, run the NVUE nv show vrf <vrf-name> router bgp address-family ipv4-unicast command or the vtysh show ip bgp vrf <vrf-name> ipv4|ipv6 unicast command.
To view the FRR IP routing table, run the vtysh show ip route vrf <vrf-name> command or the net show route vrf <vrf-name> command. These commands show all routes, including routes leaked from other VRFs.
The following example commands show all routes in VRF RED, including routes leaked from VRF BLUE:
cumulus@switch:~$ sudo vtysh
switch# show ip route vrf RED
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
F - PBR,
> - selected route, * - FIB route
VRF RED:
K * 0.0.0.0/0 [255/8192] unreachable (ICMP unreachable), 6d07h01m
C>* 10.1.1.1/32 is directly connected, BLUE, 6d07h01m
B>* 10.0.100.1/32 [200/0] is directly connected, RED(vrf RED), 6d05h10m
B>* 10.0.200.0/24 [20/0] via 10.10.2.2, swp1.11, 5d05h10m
B>* 10.0.300.0/24 [200/0] via 10.20.2.2, swp1.21(vrf RED), 5d05h10m
C>* 10.10.2.0/30 is directly connected, swp1.11, 6d07h01m
C>* 10.10.3.0/30 is directly connected, swp2.11, 6d07h01m
C>* 10.10.4.0/30 is directly connected, swp3.11, 6d07h01m
B>* 10.20.2.0/30 [200/0] is directly connected, swp1.21(vrf RED), 6d05h10m
Delete Route Leaking Configuration
The following example commands delete leaked routes from VRF BLUE to VRF RED:
cumulus@switch:~$ nv unset vrf RED router bgp address-family ipv4-unicast route-import from-vrf list BLUE
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# router bgp 65001 vrf RED
switch(config-router)# address-family ipv4 unicast
switch(config-router-af)# no import vrf BLUE
switch(config-router-af)# end
switch# write memory
switch# exit
Cumulus Linux no longer supports kernel commands. To avoid issues with VRF route leaking in FRR, do not use the kernel commands.
FRRouting in a VRF
Cumulus Linux supports BGP, OSPFv2 and static routing for both IPv4 and IPv6 within a VRF context. Various “FRRouting”) routing constructs, such as routing tables, nexthops, router-id, and related processing are also VRF-aware.
FRR learns of VRFs on the system as well as interface attachment to a VRF through notifications from the kernel.
The following sections show example VRF configurations with BGP and OSPF. For an example VRF configuration with static routing, see static routing.
BGP
Because BGP is VRF-aware, Cumulus Linux supports per-VRF neighbors, both iBGP and eBGP, as well as numbered and unnumbered interfaces. Non-interface-based VRF neighbors bind to the VRF, so you can have overlapping address spaces in different VRFs. Each VRF can have its own parameters, such as address families and redistribution. Incoming connections rely on the Linux kernel for VRF-global sockets. You can track BGP neighbors with BFD, both for single and multiple hops. You can configure multiple BGP instances, associating each with a VRF.
The following example shows a BGP unnumbered interface configuration in VRF RED. In BGP unnumbered, there are no addresses on any interface. However, debugging tools like traceroute need at least a single IP address per node as the source IP address. Typically, this address is the loopback device. With VRF, you can associate an IP address with the VRF device, which acts as the loopback interface for that VRF.
cumulus@switch:~$ nv set vrf RED table auto
cumulus@switch:~$ nv set vrf RED loopback ip address 10.10.10.1/32
cumulus@switch:~$ nv set interface swp51 ip vrf RED
cumulus@switch:~$ nv set vrf RED router bgp router-id 10.10.10.1
cumulus@switch:~$ nv set vrf RED router bgp autonomous-system 65001
cumulus@switch:~$ nv set vrf RED router bgp neighbor swp51 remote-as external
cumulus@switch:~$ nv set vrf RED router bgp address-family ipv4-unicast redistribute connected enable on
cumulus@switch:~$ nv set vrf RED router bgp neighbor swp51 address-family ipv4-unicast enable on
cumulus@switch:~$ nv config apply
/etc/network/interfaces file configuration:
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto RED
iface RED
address 10.10.10.1/32
vrf-table auto
auto swp51
iface swp51
vrf RED
...
A VRF-aware OSPFv2 configuration supports numbered and unnumbered interfaces, and layer 3 interfaces such as SVIs, subinterfaces and physical interfaces. The VRF supports types 1 through 5 (ABR and ASBR - external LSAs) and types 9 through 11 (opaque LSAs) link state advertisements, redistribution of other routing protocols, connected and static routes, and route maps. You can track OSPF neighbors with BFD.
Cumulus Linux does not support multiple VRFs in multi-instance OSPF.
The following example shows an OSPF configuration in VRF RED.
cumulus@switch:~$ nv set vrf RED loopback ip address 10.10.10.1/31
cumulus@switch:~$ nv set interface swp51 ip address 10.0.1.0/31
cumulus@switch:~$ nv set vrf RED router ospf enable on
cumulus@switch:~$ nv set vrf RED router ospf router-id 10.10.10.1
cumulus@switch:~$ nv set vrf RED router ospf redistribute connected
cumulus@switch:~$ nv set vrf RED router ospf redistribute bgp
cumulus@switch:~$ nv set vrf RED router ospf area 0.0.0.0 network 10.10.10.1/32
cumulus@switch:~$ nv set vrf RED router ospf area 0.0.0.0 network 10.0.1.0/31
cumulus@switch:~$ nv config apply
The /etc/network/interfaces file configuration:
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto RED
iface RED
address 10.10.10.1/32
vrf-table auto
auto swp51
iface swp51
address 10.0.1.0/31
vtysh commands:
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# router ospf vrf RED
switch(config-router)# ospf router-id 10.10.10.1
switch(config-router)# redistribute connected
switch(config-router)# redistribute bgp
switch(config-router)# network 10.10.10.1/32 area 0.0.0.0
switch(config-router)# network 10.0.1.0/31 area 0.0.0.0
switch(config-router)# end
switch# write memory
switch# exit
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
router ospf vrf RED
ospf router-id 10.10.10.1
network 10.10.10.1/32 area 0.0.0.0
network 10.0.1.0/31 area 0.0.0.0
redistribute connected
redistribute bgp
...
DHCP with VRF
Because you can use VRF to bind IPv4 and IPv6 sockets to non-default VRF tables, you can start DHCP servers and relays in any non-default VRF table using the dhcpd and dhcrelay services. systemd must manage these services and the /etc/vrf/systemd.conf file must list the services. By default, this file already lists these two services, as well as others. You can add more services as needed, such as dhcpd6 and dhcrelay6 for IPv6.
If you edit /etc/vrf/systemd.conf, run sudo systemctl daemon-reload to generate the systemd instance files for the newly added services. Then you can start the service in the VRF using systemctl start <service>@<vrf-name>.service, where <service> is the name of the service (such as dhcpd or dhcrelay) and <vrf-name> is the name of the VRF.
For example, to start the dhcrelay service after you configure a VRF named BLUE, run:
In addition, you need to create a separate default file in the /etc/default directory for every instance of a DHCP server or relay in a non-default VRF. To run multiple instances of any of these services, you need a separate file for each instance. The files must have the following names:
isc-dhcp-server-<vrf-name>
isc-dhcp-server6-<vrf-name>
isc-dhcp-relay-<vrf-name>
isc-dhcp-relay6-<vrf-name>
See the example configuration below for more details.
Cumulus Linux does not support DHCP server and relay across VRFs; the server and host cannot be in different VRF tables. In addition, the server and relay cannot be in different VRF tables.
Typically, a service running in the default VRF owns a port across all VRFs. If you prefer the VRF local instance, first disable and stop the global instance.
VRF is a layer 3 routing feature; only run programs that use AF_INET and AF_INET6 sockets in a VRF. VRF context does not affect any other aspects of the operation of a program.
This method only works with systemd-based services.
Example Configuration
In the following example, there is one IPv4 network with a VRF named RED and one IPv6 network with a VRF named BLUE.
IPv4 DHCP Server/relay network
IPv6 DHCP Server/relay network
Configure each DHCP server and relay as follows:
Create the file isc-dhcp-server-RED in /etc/default/. Here is sample content:
# Defaults for isc-dhcp-server initscript
# sourced by /etc/init.d/isc-dhcp-server
# installed at /etc/default/isc-dhcp-server by the maintainer scripts
#
# This is a POSIX shell fragment
#
# Path to dhcpd's config file (default: /etc/dhcp/dhcpd.conf).
DHCPD_CONF="-cf /etc/dhcp/dhcpd-RED.conf"
# Path to dhcpd's PID file (default: /var/run/dhcpd.pid).
DHCPD_PID="-pf /var/run/dhcpd-RED.pid"
# Additional options to start dhcpd with.
# Don't use options -cf or -pf here; use DHCPD_CONF/ DHCPD_PID instead
#OPTIONS=""
# On what interfaces should the DHCP server (dhcpd) serve DHCP requests?
# Separate multiple interfaces with spaces, e.g. "eth0 eth1".
INTERFACES="swp2"
cumulus@switch:~$ sudo ip vrf exec RED /usr/sbin/dhcpd -f -q -cf /
/etc/dhcp/dhcpd-RED.conf -pf /var/run/dhcpd-RED.pid swp2
Create the file isc-dhcp-server6-BLUE in /etc/default/. Here is sample content:
# Defaults for isc-dhcp-server initscript
# sourced by /etc/init.d/isc-dhcp-server
# installed at /etc/default/isc-dhcp-server by the maintainer scripts
#
# This is a POSIX shell fragment
#
# Path to dhcpd's config file (default: /etc/dhcp/dhcpd.conf).
DHCPD_CONF="-cf /etc/dhcp/dhcpd6-BLUE.conf"
# Path to dhcpd's PID file (default: /var/run/dhcpd.pid).
DHCPD_PID="-pf /var/run/dhcpd6-BLUE.pid"
# Additional options to start dhcpd with.
# Don't use options -cf or -pf here; use DHCPD_CONF/ DHCPD_PID instead
#OPTIONS=""
# On what interfaces should the DHCP server (dhcpd) serve DHCP requests?
# Separate multiple interfaces with spaces, e.g. "eth0 eth1".
INTERFACES="swp3"
cumulus@switch:~$ sudo ip vrf exec BLUE dhcpd -6 -q -cf /
/etc/dhcp/dhcpd6-BLUE.conf -pf /var/run/dhcpd6-BLUE.pid swp3
Create the file isc-dhcp-relay-RED in /etc/default/. Here is sample content:
# Defaults for isc-dhcp-relay initscript
# sourced by /etc/init.d/isc-dhcp-relay
# installed at /etc/default/isc-dhcp-relay by the maintainer scripts
#
# This is a POSIX shell fragment
#
# What servers should the DHCP relay forward requests to?
SERVERS="102.0.0.2"
# On what interfaces should the DHCP relay (dhrelay) serve DHCP requests?
# Always include the interface towards the DHCP server.
# This variable requires a -i for each interface configured above.
# This will be used in the actual dhcrelay command
# For example, "-i eth0 -i eth1"
INTF_CMD="-i swp2s2 -i swp2s3"
# Additional options that are passed to the DHCP relay daemon?
OPTIONS=""
cumulus@switch:~$ sudo ip vrf exec RED /usr/sbin/dhcrelay -d -q -i /
swp2s2 -i swp2s3 102.0.0.2
Create the file isc-dhcp-relay6-BLUE in /etc/default/. Here is sample content:
# Defaults for isc-dhcp-relay initscript
# sourced by /etc/init.d/isc-dhcp-relay
# installed at /etc/default/isc-dhcp-relay by the maintainer scripts
#
# This is a POSIX shell fragment
#
# What servers should the DHCP relay forward requests to?
#SERVERS="103.0.0.2"
# On what interfaces should the DHCP relay (dhrelay) serve DHCP requests?
# Always include the interface towards the DHCP server.
# This variable requires a -i for each interface configured above.
# This will be used in the actual dhcrelay command
# For example, "-i eth0 -i eth1"
INTF_CMD="-l swp18s0 -u swp18s1"
# Additional options that are passed to the DHCP relay daemon?
OPTIONS="-pf /var/run/dhcrelay6@BLUE.pid"
cumulus@switch:~$ sudo ip vrf exec BLUE /usr/sbin/dhcrelay -d -q -6 -l /
swp18s0 -u swp18s1 -pf /var/run/dhcrelay6@BLUE.pid
Use ping or traceroute on a VRF
You can run ping or traceroute on a VRF from the default VRF.
To ping a VRF from the default VRF, run the ping-I <vrf-name> command. For example:
cumulus@switch:~$ ping -I BLUE
To run traceroute on a VRF from the default VRF, run the traceroute -i <vrf-name> command. For example:
cumulus@switch:~$ sudo traceroute -i BLUE
Troubleshooting
You can use vtysh or Linux show commands to troubleshoot VRFs.
To show all VRFs learned by FRR from the kernel, run the show vrf command. The table ID shows the corresponding routing table in the kernel.
cumulus@switch:~$ sudo vtysh
...
switch# show vrf
vrf RED id 14 table 1012
vrf BLUE id 21 table 1013
To show the VRFs configured in BGP (including the default VRF), run the show bgp vrfs command. A non-zero ID is a VRF that you define in the /etc/network/interfaces file.
cumulus@switch:~$ sudo vtysh
...
switch# show bgp vrfs
Type Id RouterId #PeersCfg #PeersEstb Name
DFLT 0 6.0.0.7 0 0 Default
VRF 14 6.0.2.7 6 6 RED
VRF 21 6.0.3.7 6 6 BLUE
Total number of VRFs (including default): 3
To show interfaces known to FRR and attached to a specific VRF, run the show interface vrf <vrf-name> command. For example:
cumulus@switch:~$ sudo vtysh
switch# show interface vrf vrf1012
Interface br2 is up, line protocol is down
PTM status: disabled
vrf: RED
index 13 metric 0 mtu 1500
flags: <UP,BROADCAST,MULTICAST>
inet 20.7.2.1/24
inet6 fe80::202:ff:fe00:a/64
ND advertised reachable time is 0 milliseconds
ND advertised retransmit interval is 0 milliseconds
ND router advertisements are sent every 600 seconds
ND router advertisements lifetime tracks ra-interval
ND router advertisement default router preference is medium
Hosts use stateless autoconfig for addresses.
To show VRFs configured in OSPF, run the show ip ospf vrfs command. For example:
cumulus@switch:~$ sudo vtysh
...
switch# show ip ospf vrfs
Name Id RouterId
Default-IP-Routing-Table 0 0.0.0.0
RED 57 0.0.0.10
BLUE 58 0.0.0.20
Total number of OSPF VRFs (including default): 3
To show all OSPF routes in a VRF, run the show ip ospf vrf all route command. For example:
cumulus@switch:~$ sudo vtysh
...
switch# show ip ospf vrf all route
============ OSPF network routing table ============
N 7.0.0.0/24 [10] area: 0.0.0.0
directly attached to swp2
============ OSPF router routing table =============
============ OSPF external routing table ===========
============ OSPF network routing table ============
N 8.0.0.0/24 [10] area: 0.0.0.0
directly attached to swp1
============ OSPF router routing table =============
============ OSPF external routing table ===========
To see the routing table for each VRF, use the show ip route vrf all command. The OSPF route is in the row that starts with O.
cumulus@switch:~$ sudo vtysh
...
switch# show ip route vrf all
Codes: K - kernel route, C - connected, S - static, R - RIP,
O - OSPF, I - IS-IS, B - BGP, P - PIM, E - EIGRP, N - NHRP,
T - Table, v - VNC, V - VNC-Direct, A - Babel,
> - selected route, * - FIB route
VRF BLUE:
K>* 0.0.0.0/0 [0/8192] unreachable (ICMP unreachable)
O 7.0.0.0/24 [110/10] is directly connected, swp2, 00:28:35
C>* 7.0.0.0/24 is directly connected, swp2
C>* 7.0.0.5/32 is directly connected, BLUE
C>* 7.0.0.100/32 is directly connected, BLUE
C>* 50.1.1.0/24 is directly connected, swp31s1
VRF RED:
K>* 0.0.0.0/0 [0/8192] unreachable (ICMP unreachable)
O
8.0.0.0/24 [110/10]
is directly connected, swp1, 00:23:26
C>* 8.0.0.0/24 is directly connected, swp1
C>* 8.0.0.5/32 is directly connected, RED
C>* 8.0.0.100/32 is directly connected, RED
C>* 50.0.1.0/24 is directly connected, swp31s0
To list all VRFs, and include the VRF ID and table ID, run the ip -d link show type vrf command. For example:
cumulus@switch:~$ ip -d link show type vrf
14: vrf1012: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 46:96:c7:64:4d:fa brd ff:ff:ff:ff:ff:ff promiscuity 0
vrf table 1012 addrgenmode eui64
21: vrf1013: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether 7a:8a:29:0f:5e:52 brd ff:ff:ff:ff:ff:ff promiscuity 0
vrf table 1013 addrgenmode eui64
28: vrf1014: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 1000
link/ether e6:8c:4d:fc:eb:b1 brd ff:ff:ff:ff:ff:ff promiscuity 0
vrf table 1014 addrgenmode eui64
To show the interfaces attached to a specific VRF, run the ip -d link show vrf <vrf-name> command. For example:
cumulus@switch:~$ ip -d link show vrf vrf1012
8: swp1.2@swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrf1012 state UP mode DEFAULT group default
link/ether 00:02:00:00:00:07 brd ff:ff:ff:ff:ff:ff promiscuity 0
vlan protocol 802.1Q id 2 <REORDER_HDR>
vrf_slave addrgenmode eui64
9: swp2.2@swp2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrf1012 state UP mode DEFAULT group default
link/ether 00:02:00:00:00:08 brd ff:ff:ff:ff:ff:ff promiscuity
vlan protocol 802.1Q id 2 <REORDER_HDR>
vrf_slave addrgenmode eui64
10: swp3.2@swp3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrf1012 state UP mode DEFAULT group default
link/ether 00:02:00:00:00:09 brd ff:ff:ff:ff:ff:ff promiscuity 0
vlan protocol 802.1Q id 2 <REORDER_HDR>
vrf_slave addrgenmode eui64
11: swp4.2@swp4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrf1012 state UP mode DEFAULT group default
link/ether 00:02:00:00:00:0a brd ff:ff:ff:ff:ff:ff promiscuity 0
vlan protocol 802.1Q id 2 <REORDER_HDR>
vrf_slave addrgenmode eui64
12: swp5.2@swp5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master vrf1012 state UP mode DEFAULT group default
link/ether 00:02:00:00:00:0b brd ff:ff:ff:ff:ff:ff promiscuity 0
vlan protocol 802.1Q id 2 <REORDER_HDR>
vrf_slave addrgenmode eui64
13: br2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue master vrf1012 state DOWN mode DEFAULT group default
link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff promiscuity 0
bridge forward_delay 100 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768
vlan_filtering 0 vlan_protocol 802.1Q bridge_id 8000.0:0:0:0:0:0 designated_root 8000.0:0:0:0:0:0
root_port 0 root_path_cost 0 topology_change 0 topology_change_detected 0 hello_timer 0.00
tcn_timer 0.00 topology_change_timer 0.00 gc_timer 202.23 vlan_default_pvid 1 group_fwd_mask 0
group_address 01:80:c2:00:00:00 mcast_snooping 1 mcast_router 1 mcast_query_use_ifaddr 0 mcast_querier 0
mcast_hash_elasticity 4096 mcast_hash_max 4096 mcast_last_member_count 2 mcast_startup_query_count 2
mcast_last_member_interval 100 mcast_membership_interval 26000 mcast_querier_interval 25500
mcast_query_interval 12500 mcast_query_response_interval 1000 mcast_startup_query_interval 3125
nf_call_iptables 0 nf_call_ip6tables 0 nf_call_arptables 0
vrf_slave addrgenmode eui64
To show IPv4 routes in a VRF, run the ip route show table <vrf-name> command. For example:
cumulus@switch:~$ ip route show table RED
unreachable default metric 240
broadcast 20.7.2.0 dev br2 proto kernel scope link src 20.7.2.1 dead linkdown
20.7.2.0/24 dev br2 proto kernel scope link src 20.7.2.1 dead linkdown
local 20.7.2.1 dev br2 proto kernel scope host src 20.7.2.1
broadcast 20.7.2.255 dev br2 proto kernel scope link src 20.7.2.1 dead linkdown
broadcast 169.254.2.8 dev swp1.2 proto kernel scope link src 169.254.2.9
169.254.2.8/30 dev swp1.2 proto kernel scope link src 169.254.2.9
local 169.254.2.9 dev swp1.2 proto kernel scope host src 169.254.2.9
broadcast 169.254.2.11 dev swp1.2 proto kernel scope link src 169.254.2.9
broadcast 169.254.2.12 dev swp2.2 proto kernel scope link src 169.254.2.13
169.254.2.12/30 dev swp2.2 proto kernel scope link src 169.254.2.13
local 169.254.2.13 dev swp2.2 proto kernel scope host src 169.254.2.13
broadcast 169.254.2.15 dev swp2.2 proto kernel scope link src 169.254.2.13
broadcast 169.254.2.16 dev swp3.2 proto kernel scope link src 169.254.2.17
169.254.2.16/30 dev swp3.2 proto kernel scope link src 169.254.2.17
local 169.254.2.17 dev swp3.2 proto kernel scope host src 169.254.2.17
broadcast 169.254.2.19 dev swp3.2 proto kernel scope link src 169.254.2.17
To show IPv6 routes in a VRF, run the ip -6 route show table <vrf-name> command. For example:
cumulus@switch:~$ ip -6 route show table RED
local fe80:: dev lo proto none metric 0 pref medium
local fe80:: dev lo proto none metric 0 pref medium
local fe80:: dev lo proto none metric 0 pref medium
local fe80:: dev lo proto none metric 0 pref medium
local fe80::202:ff:fe00:7 dev lo proto none metric 0 pref medium
local fe80::202:ff:fe00:8 dev lo proto none metric 0 pref medium
local fe80::202:ff:fe00:9 dev lo proto none metric 0 pref medium
local fe80::202:ff:fe00:a dev lo proto none metric 0 pref medium
fe80::/64 dev br2 proto kernel metric 256 dead linkdown pref medium
fe80::/64 dev swp1.2 proto kernel metric 256 pref medium
fe80::/64 dev swp2.2 proto kernel metric 256 pref medium
fe80::/64 dev swp3.2 proto kernel metric 256 pref medium
ff00::/8 dev br2 metric 256 dead linkdown pref medium
ff00::/8 dev swp1.2 metric 256 pref medium
ff00::/8 dev swp2.2 metric 256 pref medium
ff00::/8 dev swp3.2 metric 256 pref medium
unreachable default dev lo metric 240 error -101 pref medium
To see a list of links associated with a particular VRF table, run the ip link list <vrf-name> command. For example:
cumulus@switch:~$ ip link list RED
VRF: RED
--------------------
swp1.10@swp1 UP 6c:64:1a:00:5a:0c <BROADCAST,MULTICAST,UP,LOWER_UP>
swp2.10@swp2 UP 6c:64:1a:00:5a:0d <BROADCAST,MULTICAST,UP,LOWER_UP>
To see a list of routes associated with a particular VRF table, run the ip route list <vrf-name> command. For example:
cumulus@switch:~$ ip route list RED
VRF: RED
--------------------
unreachable default metric 8192
10.1.1.0/24 via 10.10.1.2 dev swp2.10
10.1.2.0/24 via 10.99.1.2 dev swp1.10
broadcast 10.10.1.0 dev swp2.10 proto kernel scope link src 10.10.1.1
10.10.1.0/28 dev swp2.10 proto kernel scope link src 10.10.1.1
local 10.10.1.1 dev swp2.10 proto kernel scope host src 10.10.1.1
broadcast 10.10.1.15 dev swp2.10 proto kernel scope link src 10.10.1.1
broadcast 10.99.1.0 dev swp1.10 proto kernel scope link src 10.99.1.1
10.99.1.0/30 dev swp1.10 proto kernel scope link src 10.99.1.1
local 10.99.1.1 dev swp1.10 proto kernel scope host src 10.99.1.1
broadcast 10.99.1.3 dev swp1.10 proto kernel scope link src 10.99.1.1
local fe80:: dev lo proto none metric 0 pref medium
local fe80:: dev lo proto none metric 0 pref medium
local fe80::6e64:1aff:fe00:5a0c dev lo proto none metric 0 pref medium
local fe80::6e64:1aff:fe00:5a0d dev lo proto none metric 0 pref medium
fe80::/64 dev swp1.10 proto kernel metric 256 pref medium
fe80::/64 dev swp2.10 proto kernel metric 256 pref medium
ff00::/8 dev swp1.10 metric 256 pref medium
ff00::/8 dev swp2.10 metric 256 pref medium
unreachable default dev lo metric 8192 error -101 pref medium
You can also show routes in a VRF using the ip [-6] route show vrf <vrf-name> command. This command omits local and broadcast routes, which can clutter the output.
Considerations
Cumulus Linux bases table selection on the incoming interface only; packet attributes or output-interface-based selection are not available.
Setting the router ID outside of BGP using the router-id option causes all BGP instances to get the same router ID. If you want each BGP instance to have its own router ID, specify the router-id under the BGP instance using bgp router-id. If you specify both router-id and bgp router-id, the ID under the BGP instance overrides the one you provide outside BGP.
When you take down a VRF using ifdown, Cumulus Linux removes all routes associated with that VRF from FRR but it does not remove the routes from the kernel.
Management VRF
Management VRF is a subset of Virtual Routing and Forwarding - VRF (virtual routing tables and forwarding) and provides a separation between the out-of-band management network and the in-band data plane network. For VRFs, the main routing table is the default table for the data plane switch ports. With management VRF, the switch uses a second table, mgmt, for routing through the Ethernet ports of the switch. The mgmt name is special cased to identify the management VRF from a data plane VRF.
Cumulus Linux only supports eth0 (or eth1, depending on the switch platform) for out-of-band management. The Ethernet ports are software-only ports that are not hardware accelerated by switchd. VLAN subinterfaces, bonds, bridges, and the front panel switch ports are not supported as OOB management interfaces.
In band management of Cumulus Linux is possible using loopbacks and SVIs (switch virtual interfaces).
Cumulus Linux enables Management VRF by default. IPv4 and IPv6 networking applications (for example, Ansible, Chef, and apt-get) run by an administrator communicate out the management network by default. This default context does not impact services run through systemd and the systemctl command, and does not impact commands examining the state of the switch, such as the ip command to list links, neighbors, or routes.
The management VRF configurations in this section contain a localhost loopback IPv4 address of 127.0.0.1/8 and IPv6 address of ::1/128. Management VRF must have an IPv6 address as well as an IPv4 address to work correctly. Adding the loopback address to the layer 3 domain of the management VRF prevents issues with applications that expect the loopback IP address to exist in the VRF, such as NTP.
Bring Up the Management VRF
If you take down the management VRF using ifdown, to bring it back up you need to do one of two things:
Run the ifup --with-depends mgmt command
Run ifreload -a command
The following command example brings down the management VRF, then brings it back up with the ifup --with-depends mgmt command:
Running ifreload -a disconnects the session for any interface configured as auto.
Run Services within the Management VRF
At installation, the only two enabled services that run in the management VRF are NTP (ntp@mgmt.service) and netqd (netqd@mgmt). However, you can run a variety of services within the management VRF instead of the default VRF. When you run a systemd service inside the management VRF, that service runs only on eth0. You cannot configure the same service to run in both the management VRF and the default VRF; you must stop and disable the normal service with systemctl.
You must disable the following services in the default VRF if you want to run them in the management VRF:
chef-client
collectd
hsflowd
netq-agent
netq-notifier
puppet
snmpd
snmptrapd
ssh
zabbix-agent
You can configure certain services (such as snmpd) to use multiple routing tables, some in the management VRF, some in the default or additional VRFs. The kernel provides a sysctl that allows a single instance to accept connections over all VRFs.
For TCP, connected sockets bind to the VRF on which the first packet arrives.
The following steps show how to enable the SNMP service to run in the management VRF. You can enable any of the services listed above, except for dhcrelay (see DHCP Relays).
Run the following command to show the process IDs associated with the management VRF:
cumulus@switch:~$ ip vrf pids mgmt
1149 ntpd
1159 login
1227 bash
16178 vi
948 dhclient
20934 sshd
20975 bash
21343 sshd
21384 bash
21477 ip
Run the following command to show the VRF association of the specified process:
cumulus@switch:~$ ip vrf identify 2055
mgmt
Run ip vrf help for additional ip vrf commands.
Enable Polling with snmpd in a Management VRF
When you enable snmpd to run in the management VRF, you need to specify that VRF so that snmpd listens on eth0 in the management VRF; you can also configure snmpd to listen on other ports. In Cumulus Linux, SNMP configuration is VRF aware so snmpd can bind to multiple IP addresses each configured with a particular VRF (routing table). The snmpd daemon responds to polling requests on the interfaces of the VRF on which the request comes in. For information about configuring SNMP version 1, 2c, and 3 Traps and (v3) Inform messages, refer to Simple Network Management Protocol - SNMP.
The message Duplicate IPv4 address detected, some interfaces may not be visible in IP-MIB displays after starting snmpd in the management VRF. This is because the IP-MIB assumes that you cannot use the same IP address twice on the same device; the IP-MIB is not VRF aware. This message is a warning that the SNMP IP-MIB detects overlapping IP addresses on the system; it does not indicate a problem and does not impact the operation of the switch.
ping or traceroute on the Management VRF
By default, when you issue a ping or traceroute, the packet goes to the data plane network (the main routing table). To use ping or traceroute on the management network, use ping -I mgmt or traceroute -i mgmt. To select a source address within the management VRF, use the -s flag for traceroute.
To run services in the management VRF as a non-root user, you need to create a custom service based on the original service file. The following example commands configure the SSH service to run in the management VRF as a non-root user.
Run the following command to create a custom service file in the /etc/systemd/system directory.
FRR is VRF-aware and sends packets based on the switch port routing table. This includes BGP peering through loopback interfaces. BGP looks up routes in the default table. However, depending on how you redistribute your routes, you can perform the following modification.
Management VRF uses the mgmt table, including local routes. This does not affect route redistribution when you use routing protocols, such as OSPF and BGP.
To redistribute the routes in your network, use the redistribute connected command under BGP or OSPF. This enables the directly connected network out of eth0 to advertise to its neighbor.
This also creates a route on the neighbor device to the management network through the data plane.
NVIDIA recommends route maps to control advertised networks that you redistribute with the redistribute connected command.
cumulus@switch:~$ nv set router policy route-map REDISTRIBUTE rule 10 match type ipv4
cumulus@switch:~$ nv set router policy route-map REDISTRIBUTE rule 10 match interface eth0
cumulus@switch:~$ nv set router policy route-map REDISTRIBUTE rule 10 action deny
cumulus@switch:~$ nv set vrf default router bgp address-family ipv4-unicast redistribute connected route-map REDISTRIBUTE
cumulus@switch:~$ nv config apply
To limit SSH to listen on only the management VRF or to a specific IP address on the management VRF, see SSH and VRFs.
If you SSH to the switch through a switch port, SSH works as expected. If you need to SSH from the switch out of a switch port, use the ip vrf exec default ssh <switch-port-ip-address> command. For example:
cumulus@switch:~$ sudo ip vrf exec default ssh 10.3.3.3
View the Routing Tables
When you use ip route get to return information about a single route, the command resolves over the mgmt table by default. To show information about the route in the switching silicon, run this command:
cumulus@switch:~$ ip route get <ip-address>
To get the route for any VRF, run the ip route get <ip-address> oif <vrf-name> command. For example, to show the route for the management VRF, run:
cumulus@switch:~$ ip route get <ip-address> oif mgmt
mgmt Interface Class
ifupdown2 uses interface classes to create a user-defined grouping for interfaces. The special class mgmt is available to separate the management interfaces of the switch from the data interfaces. This allows you to manage the data interfaces by default using ifupdown2 commands. Performing operations on the mgmt interfaces requires specifying the --allow-mgmt option, which prevents inadvertent outages on the management interfaces. Cumulus Linux by default brings up all interfaces in both the auto (default) class and the mgmt interface class when the switch boots.
You configure the management interface in the /etc/network/interfaces file. The example below adds the management interface eth0 and the management VRF stanzas to the mgmt interface class:
...
auto lo
iface lo inet loopback
allow-mgmt eth0
iface eth0 inet dhcp
vrf mgmt
allow-mgmt mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
...
When you run ifupdown2 commands against the interfaces in the mgmt class, include --allow=mgmt with the commands. For example, to see which interfaces are in the mgmt interface class, run:
cumulus@switch:~$ ifquery l --allow=mgmt
eth0
mgmt
To reload the configurations for interfaces in the mgmt class, run:
cumulus@switch:~$ sudo ifreload --allow=mgmt
You can still bring the management interface up and down using ifup eth0 and ifdown eth0.
Management VRF and DNS
Cumulus Linux supports both DHCP and static DNS entries over management VRF through IP FIB rules, which it adds to direct lookups to the DNS addresses out of the management VRF.
For DNS to use the management VRF, the static DNS entries must reference the management VRF in the /etc/resolv.conf file. You cannot specify the same DNS server address twice to associate it with different VRFs.
For example, to specify DNS servers and associate some of them with the management VRF, run the following commands:
cumulus@switch:~$ nv set service dns default server 192.0.2.1
cumulus@switch:~$ nv set service dns mgmt server 198.51.100.31
cumulus@switch:~$ nv set service dns mgmt server 203.0.113.13
cumulus@switch:~$ nv config apply
Edit the /etc/resolv.conf file to add the DNS servers and associate some of them with the management VRF. For example:
Run the ifreload -a command to load the new configuration:
cumulus@switch:~$ ifreload -a
Because FIB rules force DNS lookups out of the management interface, this can affect data plane ports if you use overlapping addresses. For example, when the switch learns the DNS server IP address over the management VRF, it creates a FIB rule for that IP address. When DHCP relay has the same IP address, the switch forwards any DHCP discover packet arriving on the front panel port out of the management interface (eth0) even though a route is present out the front-panel port.
If you do not specify a DNS server and you lose in band connectivity, DNS does not work through the management VRF. Cumulus Linux does not assume all DNS servers are reachable through the management VRF.
Protocol Independent Multicast - PIM
PIM is a multicast control plane protocol that advertises multicast sources and receivers over a routed layer 3 network. Layer 3 multicast relies on PIM to advertise information about multicast capable routers, and the location of multicast senders and receivers. Multicast does not go through a routed network without PIM.
PIM operates in PIM-SM or PIM-DM mode. Cumulus Linux supports PIM-SM only.
PIM-SM is a pull multicast distribution method; multicast traffic only goes through the network if receivers explicitly ask for it. When a receiver pulls multicast traffic, it must notify the network periodically that it wants to continue the multicast stream.
PIM-SM has three configuration options:
ASM relies on a multicast rendezvous point (RP) to connect multicast senders and receivers that dynamically determine the shortest path through the network.
SSM requires multicast receivers to know from which source they want to receive multicast traffic instead of relying on an RP.
BiDir forwards all traffic through the RP instead of tracking multicast source IPs, allowing for greater scale but can cause inefficient traffic forwarding.
Cumulus Linux does not support IPv6 multicast routing with PIM.
Example PIM Topology
The following illustration shows a basic PIM ASM configuration:
leaf01 is the FHR, which controls the PIM register process. The FHR is the device to which the multicast sources connect.
leaf02 is the LHR, which is the last router in the path and attaches to an interested multicast receiver.
spine01 is the RP, which receives multicast data from sources and forwards traffic down a shared distribution tree to the receivers.
Basic PIM Configuration
To configure PIM:
Enable PIM on all interfaces that connect to a multicast source or receiver, and on the interface with the RP address.
With NVUE, you must also run the nv set router pim enable on command to enable and start the PIM service. This is not required for vtysh configuration.
Enable IGMP on all interfaces that attach to a host and all interfaces that attach to a multicast receiver. IGMP version 3 is the default. Only specify the version if you want to use IGMP version 2. For SSM, you must use IGMP version 3.
For ASM, on each PIM enabled switch, specify the IP address of the RP for the multicast group. You can also configure PIM to send traffic from specific multicast groups to specific RPs.
SSM uses prefix lists to configure a receiver to only allow traffic to a multicast address from a single source. This removes the need for an RP because the receiver must know the source before accepting traffic. To enable SSM, you only need to enable PIM and IGMPv3 on the interfaces.
These example commands configure leaf01, leaf02 and spine01 as shown in the topology example above.
cumulus@leaf01:~$ nv set router pim enable on
cumulus@leaf01:~$ nv set interface vlan10 router pim
cumulus@leaf01:~$ nv set interface vlan10 ip igmp
cumulus@leaf01:~$ nv set interface swp51 router pim
cumulus@leaf01:~$ nv set vrf default router pim address-family ipv4-unicast rp 10.10.10.101
cumulus@leaf01:~$ nv config apply
cumulus@leaf02:~$ nv set router pim enable on
cumulus@leaf02:~$ nv set interface vlan20 router pim
cumulus@leaf02:~$ nv set interface vlan20 ip igmp
cumulus@leaf02:~$ nv set interface swp51 router pim
cumulus@leaf02:~$ nv set vrf default router pim address-family ipv4-unicast rp 10.10.10.101
cumulus@leaf02:~$ nv config apply
cumulus@spine01:~$ nv set router pim enable on
cumulus@spine01:~$ nv set interface swp1 router pim
cumulus@spine01:~$ nv set interface swp2 router pim
cumulus@spine01:~$ nv set vrf default router pim address-family ipv4-unicast rp 10.10.10.101
cumulus@spine01:~$ nv config apply
The FRR package includes PIM. For proper PIM operation, PIM depends on Zebra. You must configure unicast routing and a routing protocol or static routes.
Edit the /etc/frr/daemons file and add pimd=yes to the end of the file:
Restarting FRR restarts all the routing protocol daemons that are enabled and running.
In the vtysh shell, run the following commands to configure the PIM interfaces. PIM must be on all interfaces facing multicast sources or multicast receivers, as well as on the interface with the RP address.
Restarting FRR restarts all the routing protocol daemons that are enabled and running.
In the vtysh shell, run the following commands to configure the PIM interfaces. PIM must be on all interfaces facing multicast sources or multicast receivers, as well as on the interface with the RP address.
Restarting FRR restarts all the routing protocol daemons that are enabled and running.
In the vtysh shell, run the following commands to configure the PIM interfaces. PIM must be on all interfaces facing multicast sources or multicast receivers, as well as on the interface with the RP address.
For ASM, configure a group mapping for a static RP:
spine01(config)# ip pim rp 10.10.10.101
spine01(config-if)# end
spine01# write memory
spine01# exit
The above commands configure the switch to send all multicast traffic to RP 10.10.10.101. The following commands configure PIM to send traffic from multicast group 224.10.0.0/16 to RP 10.10.10.101 and traffic from multicast group 224.10.2.0/24 to RP 10.10.10.102:
cumulus@leaf01:~$ nv set vrf default router pim address-family ipv4-unicast rp 10.10.10.101 group-range 224.10.0.0/16
cumulus@leaf01:~$ nv set vrf default router pim address-family ipv4-unicast rp 10.10.10.102 group-range 224.10.2.0/24
cumulus@leaf01:~$ sudo vtysh
...
spine01# configure terminal
spine01(config)# ip pim rp 10.10.10.101 224.10.0.0/16
spine01(config)# ip pim rp 10.10.10.102 224.10.2.0/16
spine01(config)# end
spine01# exit
The following commands use a prefix list to configure PIM to send traffic from multicast group 224.10.0.0/16 to RP 10.10.10.101 and traffic from multicast group 224.10.2.0/24 to RP 10.10.10.102:
cumulus@leaf01:~$ nv set router policy prefix-list MCAST1 rule 1 action permit
cumulus@leaf01:~$ nv set router policy prefix-list MCAST1 rule 1 match 224.10.0.0/16
cumulus@leaf01:~$ nv set router policy prefix-list MCAST2 rule 1 action permit
cumulus@leaf01:~$ nv set router policy prefix-list MCAST2 rule 1 match 224.10.2.0/24
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ nv set vrf default router pim address-family ipv4-unicast rp 10.10.10.101 prefix-list MCAST1
cumulus@leaf01:~$ nv set vrf default router pim address-family ipv4-unicast rp 10.10.10.102 prefix-list MCAST2
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh
...
spine01# configure terminal
switch(config)# ip prefix-list MCAST1 seq 1 permit 224.10.0.0/16
switch(config)# ip prefix-list MCAST2 seq 1 permit 224.10.2.0/24
spine01(config)# ip pim rp 10.10.10.101 prefix-list MCAST1
spine01(config)# ip pim rp 10.10.10.102 prefix-list MCAST2
spine01(config)# end
spine01# exit
You can either configure RP mappings for different multicast groups or use a prefix list to specify the RP to group mapping. You cannot use both methods at the same time.
NVIDIA recommends that you do not use a spine switch as an RP when using eBGP in a Clos network. See the PIM Overview knowledge-base article.
zebra does not resolve the next hop for the RP through the default route. To prevent multicast forwarding from failing, either provide a specific route to the RP or run the vtysh ip nht resolve-via-default configuration command to resolve the next hop for the RP through the default route.
Optional PIM Configuration
This section describes optional configuration procedures.
ASM SPT Infinity
When the LHR receives the first multicast packet, it sends a PIM (S,G) join towards the FHR to forward traffic through the network. This builds the SPT, or the tree that is the shortest path to the source. When the traffic arrives over the SPT, a PIM (S,G) RPT prune goes up the shared tree towards the RP. This removes multicast traffic from the shared tree; multicast data only goes over the SPT.
You can configure SPT switchover per group (SPT infinity), which allows for some groups to never switch to a shortest path tree. The LHR now sends both (*,G) joins and (S,G) RPT prune messages towards the RP.
When you use a prefix list in Cumulus Linux to match a multicast group destination address (GDA) range, you must include the /32 operator. In the NVUE command example below, max-prefix-len 32 after the group match range specifies the /32 operator. In the vtysh command example, ge 32 after the group permit range specifies the /32 operator.
To configure a group to never follow the SPT, create the necessary prefix lists, then configure SPT switchover for the prefix list:
cumulus@switch:~$ nv set router policy prefix-list SPTrange rule 1 match 235.0.0.0/8 max-prefix-len 32
cumulus@switch:~$ nv set router policy prefix-list SPTrange rule 1 action permit
cumulus@switch:~$ nv set router policy prefix-list SPTrange rule 2 match 238.0.0.0/8 max-prefix-len 32
cumulus@switch:~$ nv set router policy prefix-list SPTrange rule 2 action permit
cumulus@switch:~$ nv set vrf default router pim address-family ipv4-unicast spt-switchover prefix-list SPTrange
cumulus@switch:~$ nv set vrf default router pim address-family ipv4-unicast spt-switchover action infinity
cumulus@switch:~$ nv config apply
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# ip prefix-list spt-range permit 235.0.0.0/8 ge 32
switch(config)# ip prefix-list spt-range permit 238.0.0.0/8 ge 32
switch(config)# ip pim spt-switchover infinity prefix-list spt-range
switch(config)# end
switch# exit
To view the configured prefix list, run the vtysh show ip mroute command. The following command shows that SPT switchover (pimreg) is on 235.0.0.0.
cumulus@switch:~$ sudo vtysh
...
switch# show ip mroute
Source Group Proto Input Output TTL Uptime
* 235.0.0.0 IGMP swp1 pimreg 1 00:03:3
IGMP vlan10 1 00:03:38
* 238.0.0.0 IGMP swp1 vlan10 1 00:02:08
SSM Multicast Group Ranges
232.0.0.0/8 is the default multicast group range reserved for SSM. To modify the SSM multicast group range, define a prefix list and apply it. You can change (expand) the default group or add additional groups to this range.
You must include 232.0.0.0/8 in the prefix list as this is the reserved SSM range. Using a prefix-list, you can expand the SSM range but all devices in the source tree must agree on the SSM range. When you use a prefix list in Cumulus Linux to match a multicast group destination address (GDA) range, you must include the /32 operator. In the NVUE command example below, max-prefix-len 32 after the group match range specifies the /32 operator. In the vtysh command example, ge 32 after the group permit range specifies the /32 operator.
Create a prefix list with the permit keyword to match address ranges that you want to treat as multicast groups and the deny keyword for the address ranges you do not want to treat as multicast groups:
cumulus@switch:~$ nv set router policy prefix-list MyCustomSSMrange rule 5 match 232.0.0.0/8 max-prefix-len 32
cumulus@switch:~$ nv set router policy prefix-list MyCustomSSMrange rule 5 action permit
cumulus@switch:~$ nv set router policy prefix-list MyCustomSSMrange rule 10 match 238.0.0.0/8 max-prefix-len 32
cumulus@switch:~$ nv set router policy prefix-list MyCustomSSMrange rule 10 action permit
Create a prefix list with the permit keyword to match address ranges that you want to treat as multicast groups and the deny keyword for the address ranges you do not want to treat as multicast groups:
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# ip prefix-list ssm-range seq 5 permit 232.0.0.0/8 ge 32
switch(config)# ip prefix-list ssm-range seq 10 permit 238.0.0.0/8 ge 32
To view the configured prefix lists, run the vtysh show ip prefix-list my-custom-ssm-range command:
switch# show ip prefix-list my-custom-ssm-range
ZEBRA: ip prefix-list my-custom-ssm-range: 1 entries
seq 5 permit 232.0.0.0/8 ge 32
PIM: ip prefix-list my-custom-ssm-range: 1 entries
seq 10 permit 232.0.0.0/8 ge 32
PIM and ECMP
PIM uses RPF to choose an upstream interface to build a forwarding state. If you configure ECMP, PIM chooses the RPF based on the ECMP hash algorithm.
You can configure PIM to use all the available next hops when installing mroutes. For example, if you have four-way ECMP, PIM spreads the S,G and *,G mroutes across the four different paths.
You can also configure PIM to recalculate all stream paths over one of the ECMP paths if the switch loses a path. Otherwise, only the streams that are using the lost path move to alternate ECMP paths. This recalculation does not affect existing groups.
Recalculating all stream paths over one of the ECMP paths can cause some packet loss.
To configure PIM to use all the available next hops when installing mroutes:
cumulus@switch:~$ nv set vrf default router pim ecmp enable on
cumulus@switch:~$ nv config apply
To recalculate all stream paths over one of the ECMP paths if the switch loses a path:
cumulus@switch:~$ nv set vrf default router pim ecmp rebalance on
cumulus@switch:~$ nv config apply
To configure PIM to use all the available next hops when installing mroutes:
To show the next hop for a specific source or group, run the vtysh show ip pim nexthop command:
cumulus@switch:~$ sudo vtysh
...
switch# show ip pim nexthop
Number of registered addresses: 3
Address Interface Nexthop
-------------------------------------------
6.0.0.9 swp31s0 169.254.0.9
6.0.0.9 swp31s1 169.254.0.25
6.0.0.11 lo 0.0.0.0
6.0.0.10 swp31s0 169.254.0.9
6.0.0.10 swp31s1 169.254.0.25
IP Multicast Boundaries
Use multicast boundaries to limit the distribution of multicast traffic and push multicast to a subset of the network. With boundaries in place, the switch drops or accepts incoming IGMP or PIM joins according to a prefix list. To configure the boundary, apply an IP multicast boundary OIL (outgoing interface list) on an interface.
First create a prefix list consisting of multicast group addresses, then run the following commands:
You can use MSDP to connect multiple PIM-SM multicast domains using the PIM-SM RPs. If you configure anycast RPs with the same IP address on multiple multicast switches (on the loopback interface), you can use more than one RP per multicast group.
When an RP discovers a new source (a PIM-SM register message), it sends an SA message to each MSDP peer. The peer then determines if there are any interested receivers.
Cumulus Linux supports MSDP for anycast RP, not multiple multicast domains. You must configure each MSDP peer in a full mesh. The switch does not forward received SA messages.
Cumulus Linux only supports one MSDP mesh group.
The following steps configure a Cumulus switch to use MSDP:
Add an anycast IP address to the loopback interface for each RP in the domain:
cumulus@rp01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@rp01:~$ nv set interface lo ip address 10.100.100.100/32
On every multicast switch, configure the group to RP mapping using the anycast address:
cumulus@switch:$ nv set vrf default router pim address-family ipv4-unicast rp 10.100.100.100 group-range 224.0.0.0/4
cumulus@switch:$ nv config apply
Configure the MSDP mesh group for all active RPs. The following example uses three RPs:
The mesh group must include all RPs in the domain as members, with a unique address as the source. This configuration results in MSDP peerings between all RPs.
Inject the anycast IP address into the IGP of the domain. If the network uses unnumbered BGP as the IGP, avoid using the anycast IP address to establish unicast or multicast peerings. For PIM-SM, ensure that you use the unique address as the PIM hello source by setting the source:
cumulus@rp01:$ nv set interface lo router pim address-family ipv4-unicast use-source 10.100.100.100
cumulus@rp01:$ nv config apply
Edit the /etc/network/interfaces file to add an anycast IP address to the loopback interface for each RP in the domain. For example:
cumulus@rp01:~$ sudo nano /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.101/32
address 10.100.100.100/32
...
Run the ifreload -a command to load the new configuration:
cumulus@switch:~$ ifreload -a
On every multicast switch, configure the group to RP mapping using the anycast address:
cumulus@rp01:~$ sudo vtysh
...
rp01# configure terminal
rp01(config)# ip pim rp 10.100.100.100 224.0.0.0/4
Configure the MSDP mesh group for all active RPs (the following example uses three RPs):
The mesh group must include all RPs in the domain as members, with a unique address as the source. This configuration results in MSDP peerings between all RPs.
rp01(config)# ip msdp mesh-group cumulus member 100.1.1.2
rp01(config)# ip msdp mesh-group cumulus member 100.1.1.3
rp02(config)# ip msdp mesh-group cumulus member 100.1.1.1
rp02(config)# ip msdp mesh-group cumulus member 100.1.1.3
rp03(config)# ip msdp mesh-group cumulus member 100.1.1.1
rp03(config)# ip msdp mesh-group cumulus member 100.1.1.2
Pick the local loopback address as the source of the MSDP control packets
rp01(config)# ip msdp mesh-group cumulus source 10.10.10.101
rp02(config)# ip msdp mesh-group cumulus source 10.10.10.102
rp03(config)# ip msdp mesh-group cumulus source 10.10.10.103
Inject the anycast IP address into the IGP of the domain. If the network uses unnumbered BGP as the IGP, avoid using the anycast IP address to establish unicast or multicast peerings. For PIM-SM, ensure that you use the unique address as the PIM hello source by setting the source:
rp01# interface lo
rp01(config-if)# ip pim use-source 100.100.100.100
rp01(config-if)# end
rp01# write memory
rp01# exit
PIM in a VRF
VRFs divide the routing table on a per-tenant basis to provide separate layer 3 networks over a single layer 3 infrastructure. With a VRF, each tenant has its own virtualized layer 3 network so IP addresses can overlap between tenants.
PIM in a VRF enables PIM trees and multicast data traffic to run inside a layer 3 virtualized network, with a separate tree per domain or tenant. Each VRF has its own multicast tree with its own RPs, sources, and so on. Therefore, you can have one tenant per corporate division, client, or product.
If you do not enable MP-BGPMPLS VPN, VRFs on different switches typically connect or peer over subinterfaces, where each subinterface is in its own VRF.
To configure PIM in a VRF:
Add the VRFs and associate them with switch ports:
cumulus@switch:~$ nv set vrf RED
cumulus@switch:~$ nv set vrf BLUE
cumulus@switch:~$ nv set interface swp1 ip vrf RED
cumulus@switch:~$ nv set interface swp2 ip vrf BLUE
Add PIM configuration:
cumulus@switch:~$ nv set interface swp1 router pim
cumulus@switch:~$ nv set interface swp2 router pim
cumulus@switch:~$ nv set vrf RED router bgp autonomous-system 65001
cumulus@switch:~$ nv set vrf BLUE router bgp autonomous-system 65000
cumulus@switch:~$ nv set vrf RED router bgp router-id 10.1.1.1
cumulus@switch:~$ nv set vrf BLUE router bgp router-id 10.1.1.2
cumulus@switch:~$ nv set vrf RED router bgp neighbor swp1 remote-as external
cumulus@switch:~$ nv set vrf BLUE router bgp neighbor swp2 remote-as external
cumulus@switch:~$ nv config apply
Edit the /etc/network/interfaces file and to the VRFs and associate them with switch ports, then run ifreload -a to reload the configuration.
cumulus@switch:~$ sudo nano /etc/network/interfaces
...
auto swp1
iface swp1
vrf RED
auto swp2
iface swp2
vrf BLUE
auto RED
iface RED
vrf-table auto
auto BLUE
iface BLUE
vrf-table auto
...
You can use BFD for PIM neighbors to detect link failures. When you configure an interface, include the pim bfd option. The following example commands configure BFD between leaf01 and spine01:
cumulus@leaf01:~$ nv set interface swp51 router pim bfd enable on
cumulus@leaf01:~$ nv config apply
cumulus@spine01:~$ nv set interface swp1 router pim bfd enable on
cumulus@spine01:~$ nv config apply
To begin receiving multicast traffic for a group, a receiver expresses its interest in the group by sending an IGMP membership report on its connected LAN. The LHR receives this report and begins to build a multicast routing tree back towards the source. To build this tree, another router known both to the LHR and to the multicast source needs to exist to act as an RP for senders and receivers. The LHR looks up the RP for the group specified by the receiver and sends a PIM Join message towards the RP. Per RFC 7761, intermediary routers between the LHR and the RP must check that the RP for the group matches the one in the PIM Join, and if not, to drop the Join.
In some configurations, it is desirable to configure the LHR with an RP address that does not match the actual RP address for the group. In this case, you must configure the upstream routers to accept the Join and propagate it towards the appropriate RP for the group, ignoring the mismatched RP address in the PIM Join and replacing it with its own RP for the group.
You can configure the switch to allow joins from all upstream neighbors or you can provide a prefix list so that the switch only accepts joins with an upstream neighbor address.
The following example command configures PIM to ignore the RP check for all upstream neighbors:
cumulus@switch:~$ nv set interface swp50 router pim address-family ipv4-unicast allow-rp enable on
cumulus@switch:~$ nv config apply
The following example command configures PIM to only ignore the RP check for RP addresses in the prefix list called allowRP:
The interval in seconds at which the PIM router sends hello messages to discover PIM neighbors and maintain PIM neighbor relationships. You can specify a value between 1 and 180. The default setting is 30 seconds. With vtysh, you set the hello interval for a specific PIM enabled interface. With NVUE, you can set the hello interval globally for all PIM enabled interfaces or for a specific PIM enabled interface.
holdtime
The number of seconds during which the neighbor must be in a reachable state. auto (the default setting) uses three and half times the hello-interval. You can specify a value between 1 and 180. With vtysh, you set the holdtime for a specific PIM enabled interface. With NVUE, you can set the holdtime globally for all PIM enabled interfaces or for a specific PIM enabled interface.
join-prune-interval
The interval in seconds at which a PIM router sends join/prune messages to its upstream neighbors for a state update. You can specify a value between 60 and 600. The default setting is 60 seconds. You set the join-prune-interval globally for all PIM enabled interfaces. NVUE also provides the option of setting the join-prune-interval for a specific VRF.
keep-alive
The timeout value for the S,G stream in seconds. You can specify a value between 31 and 60000. The default setting is 210 seconds. You can set the keep-alive timer globally or all PIM enabled interfaces or for a specific VRF.
register-suppress
The number of seconds during which to stop sending register messages to the RP. You can specify a value between 5 and 60000. The default setting is 60 seconds. You can set the keep-alive timer globally for all PIM enabled interfaces or for a specific VRF.
rp-keep-alive
NVUE only. The timeout value for the RP in seconds. You can specify a value between 31 and 60000. The default setting is 185 seconds. You set the register-suppress-time timer globally for all PIM enabled interfacesor for a specific VRF.
The following example commands set the join-prune-interval to 100 seconds, the keep-alive timer to 10000 seconds, and the register-suppress time to 20000 seconds globally for all PIM enabled interfaces:
cumulus@switch:~$ nv set router pim timers join-prune-interval 100
cumulus@switch:~$ nv set router pim timers keep-alive 10000
cumulus@switch:~$ nv set router pim timers register-suppress 20000
cumulus@switch:~$ nv config apply
The following example commands set the hello-interval to 60 seconds for swp51:
The following example commands set the rp-keep-alive to 10000 for VRF RED:
cumulus@switch:~$ nv set vrf RED router pim timers rp-keep-alive 10000
cumulus@switch:~$ nv config apply
The following example commands set the join-prune-interval to 100 seconds, the keep-alive timer to 10000 seconds, and the register-suppress time to 20000 seconds globally for all PIM enabled interfaces:
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# ip pim join-prune-interval 100
switch(config)# ip pim keep-alive-timer 10000
switch(config)# ip pim register-suppress-time 20000
switch(config)# end
switch# write memory
switch# exit
The following example command sets the hello-interval to 60 seconds and the holdtime to 120 for swp51:
The following example command sets the keep-alive-timer to 10000 seconds for VRF RED:
cumulus@switch:~$ sudo vtysh
...
switch# configure terminal
switch(config)# vrf RED
switch(config-vrf)# ip pim keep-alive-timer 10000
switch(config-if)# end
switch# write memory
switch# exit
Improve Multicast Convergence
For large multicast environments, the default CoPP policer might be too restrictive. You can adjust the policer to improve multicast convergence.
The default PIM forwarding rate and burst rate is set to 2000 packets per second.
The default IGMP forwarding rate and burst rate is set to 1000 packets per second.
To adjust the policer:
The following example commands set the PIM forwarding and burst rate to 400 packets per second:
cumulus@switch:~$ nv set system control-plane policer pim-ospf-rip rate 400
cumulus@switch:~$ nv set system control-plane policer pim-ospf-rip burst 400
cumulus@switch:~$ nv config apply
The following example commands set the IGMP forwarding rate to 400 and the IGMP burst rate to 200 packets per second:
cumulus@switch:~$ nv set system control-plane policer igmp rate 400
cumulus@switch:~$ nv set system control-plane policer igmp burst 200
cumulus@switch:~$ nv config apply
Edit the /etc/cumulus/control-plane/policers.conf file:
To tune the PIM forwarding and burst rate, change the copp.pim_ospf_rip.rate and copp.pim_ospf_rip.burst parameters.
To tune the IGMP forwarding and burst rate, change the copp.igmp.rate and copp.igmp.burst parameters.
The following example changes the PIM forwarding rate and the PIM burst rate to 400 packets per second, the IGMP forwarding rate to 400 packets per second and the IGMP burst rate to 200 packets per second:
When a multicast sender attaches to an MLAG bond, the sender hashes the outbound multicast traffic over a single member of the bond. Traffic arrives on one of the MLAG enabled switches. Regardless of which switch receives the traffic, it goes over the MLAG peer link to the other MLAG-enabled switch, because the peerlink is always the multicast router port and always receives the multicast stream.
Traffic from multicast sources attached to an MLAG bond always goes over the MLAG peerlink. Be sure to
size the peerlink appropriately to accommodate this traffic.
The PIM DR for the VLAN where the source resides sends the PIM register towards the RP. The PIM DR is the PIM speaker with the highest IP address on the segment. After the PIM register process is complete and traffic is flowing along the SPT, either MLAG switch forwards traffic towards the receivers.
PIM joins sent towards the source can be ECMP load shared by upstream PIM neighbors. Either MLAG member can receive the PIM join and forward traffic, regardless of DR status.
A dual-attached multicast receiver sends an IGMP join on the attached VLAN. One of the MLAG switches receives the IGMP join, then adds the IGMP join to the IGMP Join table and layer 2 MDB table. The layer 2 MDB table, like the unicast MAC address table, synchronizes through MLAG control messages over the peerlink. This allows both MLAG switches to program IGMP and MDB table forwarding information. Both switches send *,G PIM Join messages towards the RP. If the source is already sending, both MLAG switches receive the multicast stream.
Traditionally, the PIM DR is the only node to send the PIM *,G Join. To provide resiliency in case of failure, both MLAG switches send PIM *,G Joins towards the RP to receive the multicast stream.
To prevent duplicate multicast packets, PIM elects a DF, which is the primary member of the MLAG pair. The MLAG secondary switch puts the VLAN in the OIL, preventing duplicate multicast traffic.
Example Traffic Flow
The examples below show the flow of traffic between server02 and server03:
Step 1
1. server02 sends traffic to leaf02.
2. leaf02 forwards traffic to leaf01 because the peerlink is a multicast router port.
3. spine01 receives a PIM register from leaf01, the DR.
4. leaf02 syncs the *,G table from leaf01 as an MLAG active-active peer.
Step 2
1. leaf02 has the *,G route indicating that it must forward traffic towards spine01.
2. Either leaf02 or leaf01 sends this traffic directly based on which MLAG switch receives it from the attached source.
3. In this case, leaf02 receives the traffic on the MLAG bond and forwards it directly upstream.
Configure PIM with MLAG
You can use a multicast sender or receiver over a dual-attached MLAG bond. On the VLAN interface where multicast sources or receivers exist, configure PIM active-active and IGMP. Enabling PIM active-active automatically enables PIM on that interface.
cumulus@leaf01:~$ nv set interface vlan10 router pim active-active on
cumulus@leaf01:~$ nv set interface vlan10 ip igmp
cumulus@leaf01:~$ nv config apply
cumulus@leaf01:~$ sudo vtysh
...
leaf01# configure terminal
leaf01(config)# interface vlan10
leaf01(config-if)# ip pim active-active
leaf01(config-if)# ip igmp
leaf01(config-if)# end
leaf01# write memory
leaf01# exit
To verify PIM active-active configuration, run the vtysh show ip pim mlag summary command or the net show pim mlag summary command:
cumulus@leaf01:mgmt:~$ sudo vtysh
...
leaf01# show ip pim mlag summary
MLAG daemon connection: up
MLAG peer state: up
Zebra peer state: up
MLAG role: PRIMARY
Local VTEP IP: 0.0.0.0
Anycast VTEP IP: 0.0.0.0
Peerlink: peerlink.4094
Session flaps: mlagd: 0 mlag-peer: 0 zebra-peer: 0
Message Statistics:
mroute adds: rx: 5, tx: 5
mroute dels: rx: 0, tx: 0
peer zebra status updates: 1
PIM status updates: 0
VxLAN updates: 0
Troubleshooting
This section provides commands to examine your PIM configuration and provides troubleshooting tips.
PIM Show Commands
To show the contents of the IP multicast routing table, run the vtysh show ip mroute command or the net show mroute command. You can verify the (S,G) and (*,G) state entries from the flags and check that the incoming and outgoing interfaces are correct:
cumulus@fhr:~$ sudo vtysh
...
fhr# show ip mroute
IP Multicast Routing Table
Flags: S - Sparse, C - Connected, P - Pruned
R - RP-bit set, F - Register flag, T - SPT-bit set
Source Group Flags Proto Input Output TTL Uptime
10.1.10.101 239.1.1.1 SFP none vlan10 none 0 --:--:--
To see the active source on the switch, run the vtysh show ip pim upstream command or the net show pim upstream command.
cumulus@fhr:~$ sudo vtysh
...
fhr# show ip pim upstream
Iif Source Group State Uptime JoinTimer RSTimer KATimer RefCnt
vlan10 10.1.10.101 239.1.1.1 Prune 00:07:40 --:--:-- 00:00:36 00:02:50 1
To show upstream information for S,Gs and the desire to join the multicast tree, run the vtysh show ip pim upstream-join-desired command or the net show pim upstream-join-desired command.
cumulus@fhr:~$ sudo vtysh
...
fhr# show ip pim upstream-join-desired
Source Group EvalJD
10.1.10.101 239.1.1.1 yes
To show the PIM interfaces on the switch, run the vtysh show ip pim interface command or the net show pim interface command.
cumulus@fhr:mgmt:~$ sudo vtysh
...
fhr# show ip pim interface
Interface State Address PIM Nbrs PIM DR FHR IfChannels
lo up 10.10.10.1 0 local 0 0
swp51 up 10.10.10.1 1 10.10.10.101 0 0
vlan10 up 10.1.10.1 0 local 1 0
The vtysh show ip pim interface detail command and the net show pim interface detail command shows more detail about the PIM interfaces on the switch:
cumulus@fhr:~$ sudo vtysh
...
fhr# show ip pim interface detail
...
Interface : vlan10
State : up
Address : 10.1.10.1 (primary)
fe80::4638:39ff:fe00:31/64
Designated Router
-----------------
Address : 10.1.10.1
Priority : 1(0)
Uptime : --:--:--
Elections : 1
Changes : 0
FHR - First Hop Router
----------------------
239.1.1.1 : 10.1.10.101 is a source, uptime is 00:03:08
...
To show local membership information for a PIM interface, run the vtysh show ip pim local-membership command or the net show pim local-membership.
cumulus@lhr:~$ sudo vtysh
...
lhr# show ip pim local-membership
Interface Address Source Group Membership
vlan20 10.2.10.1 * 239.1.1.1 INCLUDE
To show information about known S,Gs, the IIF and the OIL, run the vtysh show ip pim state command or the net show pim state command.
cumulus@fhr:~$ sudo vtysh
...
fhr# show ip pim state
Codes: J -> Pim Join, I -> IGMP Report, S -> Source, * -> Inherited from (*,G), V -> VxLAN, M -> Muted
Active Source Group RPT IIF OIL
1 10.1.10.101 239.1.1.1 n vlan10
To verify that the receiver is sending IGMP reports (joins) for the group, run the vtysh show ip igmp groups command or the net show igmp groups command.
cumulus@lhr:~$ sudo vtysh
...
lhr# show ip igmp groups
Total IGMP groups: 1
Watermark warn limit(Not Set): 0
Interface Address Group Mode Timer Srcs V Uptime
vlan20 10.2.10.1 239.1.1.1 EXCL 00:02:18 1 3 05:27:33
To show IGMP source information, run the vtysh show ip igmp sources command or the net show igmp sources command.
cumulus@lhr:~$ sudo vtysh
...
lhr# show ip igmp sources
Interface Address Group Source Timer Fwd Uptime
vlan20 10.2.10.1 239.1.1.1 * 03:13 Y 05:28:42
FHR Stuck in the Registering Process
When a multicast source starts, the FHR sends unicast PIM register messages from the RPF interface towards the source. After the RP receives the PIM register, it sends a PIM register stop message to the FHR to end the register process. If an issue occurs with this communication, the FHR becomes stuck in the registering process, which can result in high CPU (the FHR CPU generates and sends PIM register packets to the RP CPU).
To assess this issue, review the FHR. You can see the output interface of pimreg here. If this does not change to an interface within a couple of seconds, it is possible that the FHR remains in the registering process.
cumulus@fhr:~$ sudo vtysh
...
fhr# show ip mroute
Source Group Proto Input Output TTL Uptime
10.1.10.101 239.2.2.3 PIM vlan10 pimreg 1 00:03:59
To troubleshoot the issue:
Validate that the FHR can reach the RP. If the RP and FHR can not communicate, the registration process fails:
cumulus@fhr:~$ ping 10.10.10.101
PING 10.10.10.101 (10.10.10.101) from 10.1.10.1: 56(84) bytes of data.
^C
--- 10.0.0.21 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3000ms
On the RP, use tcpdump to see if the PIM register packets arrive:
cumulus@rp01:~$ sudo tcpdump -i swp1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on swp1, link-type EN10MB (Ethernet), capture size 262144 bytes
23:33:17.524982 IP 10.1.10.101 > 10.10.10.101: PIMv2, Register, length 66
If the switch is receiving PIM registration packets, verify that PIM sees them by running the vtysh debug pim packets command:
cumulus@fhr:~$ sudo vtysh -c "debug pim packets"
PIM Packet debugging is on
cumulus@rp01:~$ sudo tail /var/log/frr/frr.log
2016/10/19 23:46:51 PIM: Recv PIM REGISTER packet from 172.16.5.1 to 10.0.0.21 on swp30: ttl=255 pim_version=2 pim_msg_size=64 checksum=a681
Repeat the process on the FHR to see that it receives PIM register stop messages and passes them to the PIM process:
cumulus@fhr:~$ sudo tcpdump -i swp51
23:58:59.841625 IP 172.16.5.1 > 10.0.0.21: PIMv2, Register, length 28
23:58:59.842466 IP 10.0.0.21 > 172.16.5.1: PIMv2, Register Stop, length 18
cumulus@fhr:~$ sudo vtysh -c "debug pim packets"
PIM Packet debugging is on
cumulus@fhr:~$ sudo tail -f /var/log/frr/frr.log
2016/10/19 23:59:38 PIM: Recv PIM REGSTOP packet from 10.10.10.101 to 10.10.10.1 on swp51: ttl=255 pim_version=2 pim_msg_size=18 checksum=5a39
LHR Does Not Build *,G
If you do not enable both PIM and IGMP on an interface facing a receiver, the LHR does not build *,G.
cumulus@lhr:~$ sudo vtysh
...
lhr# show run
!
interface vlan20
ip igmp
ip pim
To troubleshoot this issue, ensure that the receiver sends IGMPv3 joins when you enable both PIM and IGMP:
cumulus@lhr:~$ sudo tcpdump -i vlan20 igmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vlan20, link-type EN10MB (Ethernet), capture size 262144 bytes
00:03:55.789744 IP 10.2.10.1 > igmp.mcast.net: igmp v3 report, 1 group record(s)
No mroute Created on the FHR
To troubleshoot this issue:
Verify that the FHR is receiving multicast traffic:
cumulus@fhr:~$ sudo tcpdump -i vlan10
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vlan10, link-type EN10MB (Ethernet), capture size 262144 bytes
19:57:58.429632 IP 10.1.10.101.42420 > 239.1.1.1.1000: UDP, length 8
19:57:59.431250 IP 10.1.10.101.42420 > 239.1.1.1.1000: UDP, length 8
Verify PIM configuration on the interface facing the source:
cumulus@fhr:~$ sudo vtysh
...
fhr# show run
!
interface vlan10
ip igmp
ip pim
!
Verify that the RPF interface for the source matches the interface that receives multicast traffic:
fhr# show ip rpf 10.1.10.1
Routing entry for 10.1.10.0/24 using Unicast RIB
Known via "connected", distance 0, metric 0, best
Last update 1d00h26m ago
* directly connected, vlan10
Verify RP configuration for the multicast group:
fhr# show ip pim rp-info
RP address group/prefix-list OIF I am RP Source
10.10.10.101 224.0.0.0/4 swp51 no Static
No S,G on the RP for an Active Group
An RP does not build an mroute when there are no active receivers for a multicast group even though the FR creates the mroute.
cumulus@rp01:~$ sudo vtysh
...
rp01# show ip mroute
Source Group Flags Proto Input Output TTL Uptime
You can see the active source on the RP with either the vtysh show ip pim upstream command or the net show pim upstream command.
cumulus@rp01:~$ sudo vtysh
...
rp01# show ip pim upstream
Iif Source Group State Uptime JoinTimer RSTimer KATimer RefCnt
vlan10 10.1.10.101 239.1.1.1 Prune 00:08:03 --:--:-- --:--:-- 00:02:20 1
No mroute Entry in Hardware
To verify that the hardware IP multicast entry is the maximum value, run the cl-resource-query | grep Mcast command or the net show system asic | grep Mcast command.
cumulus@switch:~$ cl-resource-query | grep Mcast
Total Mcast Routes: 450, 0% of maximum value 450
To verify the state of MSDP sessions, run the vtysh show ip msdp mesh-group command or the net show msdp mesh-group command.
cumulus@switch:~$ sudo vtysh
...
switch# show ip msdp mesh-group
Mesh group : pod1
Source : 10.1.10.101
Member State
10.1.10.102 established
10.1.10.103 established
cumulus@switch:~$ sudo vtysh
switch# show ip msdp peer
Peer Local State Uptime SaCnt
10.1.10.102 10.1.10.101 established 00:07:21 0
10.1.10.103 10.1.10.101 established 00:07:21 0
View the Active Sources
To review the active sources that the switch learns locally (through PIM registers) and from MSDP peers, run the vtysh show ip msdp sa command or the net show msdp sa command.
cumulus@switch:~$ sudo vtysh
...
switch# show ip msdp sa
Source Group RP Local SPT Uptime
10.1.10.101 239.1.1.1 10.10.10.101 n n 00:00:40
10.1.10.101 239.1.1.2 100.10.10.101 n n 00:00:25
Configuration Example
The following example configures PIM and BGP on leaf01, leaf02, and spine01.
server01 (the source) connects to leaf01 (the FHR) through a VLAN-aware bridge (VLAN 10).
leaf01 connects to spine01 (the RP) through swp51.
spine01 connects to leaf02 (the LHR) through swp2.
leaf02 connects to server02 (the receiver) through a VLAN-aware bridge (VLAN 20).
Traffic Flow along the Shared Tree
1. The FHR receives a multicast data packet from the source, encapsulates the packet in a unicast PIM register message, then sends it to the RP.
2. The RP builds an (S,G) mroute, decapsulates the multicast packet, then forwards it along the (*,G) tree towards the receiver.
3. The LHR receives multicast traffic and sees that it has a shorter path to the source. It requests the multicast stream from leaf01 and simultaneously sends the multicast stream to the receiver.
Traffic Flow for the Shortest Path Tree
1. The FHR hears a PIM join directly from the LHR and forwards multicast traffic directly to it.
2. The LHR receives the multicast packet both from the FHR and the RP. The LHR discards the packet from the RP and prunes itself from the RP.
3. The RP receives a prune message from the LHR and instructs the FHR to stop sending PIM register messages
4. Traffic continues directly between the FHR and the LHR.
cumulus@leaf01:~$ nv set router pim enable on
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp1,swp49,swp51
cumulus@leaf01:~$ nv set interface swp1 bridge domain br_default
cumulus@leaf01:~$ nv set interface swp1 bridge domain br_default access 10
cumulus@leaf01:~$ nv set bridge domain br_default vlan 10
cumulus@leaf01:~$ nv set interface vlan10 ip address 10.1.10.1/24
cumulus@leaf01:~$ nv set router bgp autonomous-system 65101
cumulus@leaf01:~$ nv set router bgp router-id 10.10.10.1
cumulus@leaf01:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.1/32
cumulus@leaf01:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.1.10.0/24
cumulus@leaf01:~$ nv set interface lo router pim
cumulus@leaf01:~$ nv set interface swp51 router pim
cumulus@leaf01:~$ nv set interface vlan10 router pim
cumulus@leaf01:~$ nv set interface vlan10 ip igmp
cumulus@leaf01:~$ nv set vrf default router pim address-family ipv4-unicast rp 10.10.10.101
cumulus@leaf01:~$ nv config apply
cumulus@leaf02:~$ nv set router pim enable on
cumulus@leaf02:~$ nv set interface lo ip address 10.10.10.2/32
cumulus@leaf02:~$ nv set interface swp2,swp49,swp51
cumulus@leaf02:~$ nv set interface swp2 bridge domain br_default
cumulus@leaf02:~$ nv set interface swp2 bridge domain br_default access 20
cumulus@leaf02:~$ nv set bridge domain br_default vlan 20
cumulus@leaf02:~$ nv set interface vlan20 ip address 10.2.10.1/24
cumulus@leaf02:~$ nv set router bgp autonomous-system 65102
cumulus@leaf02:~$ nv set router bgp router-id 10.10.10.2
cumulus@leaf02:~$ nv set vrf default router bgp neighbor swp51 remote-as external
cumulus@leaf02:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.2/32
cumulus@leaf02:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.2.10.0/24
cumulus@leaf02:~$ nv set interface lo router pim
cumulus@leaf02:~$ nv set interface swp51 router pim
cumulus@leaf02:~$ nv set interface vlan20 router pim
cumulus@leaf02:~$ nv set interface vlan20 ip igmp
cumulus@leaf02:~$ nv set vrf default router pim address-family ipv4-unicast rp 10.10.10.101
cumulus@leaf02:~$ nv config apply
cumulus@spine01:~$ nv set router pim enable on
cumulus@spine01:~$ nv set interface lo ip address 10.10.10.101/32
cumulus@spine01:~$ nv set router bgp autonomous-system 65199
cumulus@spine01:~$ nv set router bgp router-id 10.10.10.101
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp1 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp neighbor swp2 remote-as external
cumulus@spine01:~$ nv set vrf default router bgp address-family ipv4-unicast network 10.10.10.101/32
cumulus@spine01:~$ nv set interface lo router pim
cumulus@spine01:~$ nv set interface swp1 router pim
cumulus@spine01:~$ nv set interface swp2 router pim
cumulus@spine01:~$ nv set vrf default router pim address-family ipv4-unicast rp 10.10.10.101
cumulus@spine01:~$ nv config apply
cumulus@leaf01:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'10': {}
interface:
lo:
ip:
address:
10.10.10.1/32: {}
router:
pim:
enable: on
type: loopback
swp1:
bridge:
domain:
br_default: {}
type: swp
swp49:
type: swp
swp51:
router:
pim:
enable: on
type: swp
vlan10:
ip:
address:
10.1.10.1/24: {}
igmp:
enable: on
router:
pim:
enable: on
type: svi
vlan: 10
router:
bgp:
autonomous-system: 65101
enable: on
router-id: 10.10.10.1
pim:
enable: on
system:
hostname: leaf01
vrf:
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
network:
10.1.10.0/24: {}
10.10.10.1/32: {}
enable: on
neighbor:
swp51:
remote-as: external
type: unnumbered
pim:
address-family:
ipv4-unicast:
rp:
10.10.10.101: {}
enable: on
cumulus@leaf02:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
bridge:
domain:
br_default:
vlan:
'20': {}
interface:
lo:
ip:
address:
10.10.10.2/32: {}
router:
pim:
enable: on
type: loopback
swp2:
bridge:
domain:
br_default: {}
type: swp
swp49:
type: swp
swp51:
router:
pim:
enable: on
type: swp
vlan20:
ip:
address:
10.2.10.1/24: {}
igmp:
enable: on
router:
pim:
enable: on
type: svi
vlan: 20
router:
bgp:
autonomous-system: 65102
enable: on
router-id: 10.10.10.2
pim:
enable: on
system:
hostname: leaf02
vrf:
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
network:
10.2.10.0/24: {}
10.10.10.2/32: {}
enable: on
neighbor:
swp51:
remote-as: external
type: unnumbered
pim:
address-family:
ipv4-unicast:
rp:
10.10.10.101: {}
enable: on
cumulus@spine01:mgmt:~$ sudo cat /etc/nvue.d/startup.yaml
- set:
interface:
lo:
ip:
address:
10.10.10.101/32: {}
router:
pim:
enable: on
type: loopback
swp1:
router:
pim:
enable: on
type: swp
swp2:
router:
pim:
enable: on
type: swp
router:
bgp:
autonomous-system: 65199
enable: on
router-id: 10.10.10.101
pim:
enable: on
system:
hostname: spine01
vrf:
default:
router:
bgp:
address-family:
ipv4-unicast:
enable: on
network:
10.10.10.101/32: {}
enable: on
neighbor:
swp1:
remote-as: external
type: unnumbered
swp2:
remote-as: external
type: unnumbered
pim:
address-family:
ipv4-unicast:
rp:
10.10.10.101: {}
enable: on
cumulus@leaf01:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
bridge-access 10
auto swp49
iface swp49
auto swp51
iface swp51
auto vlan10
iface vlan10
address 10.1.10.1/24
hwaddress 44:38:39:22:01:b1
vlan-raw-device br_default
vlan-id 10
auto br_default
iface br_default
bridge-ports swp1
hwaddress 44:38:39:22:01:b1
bridge-vlan-aware yes
bridge-vids 10
bridge-pvid 1
cumulus@leaf02:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.2/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp2
iface swp2
bridge-access 20
auto swp49
iface swp49
auto swp51
iface swp51
auto vlan20
iface vlan20
address 10.2.10.1/24
hwaddress 44:38:39:22:01:af
vlan-raw-device br_default
vlan-id 20
auto br_default
iface br_default
bridge-ports swp2
hwaddress 44:38:39:22:01:af
bridge-vlan-aware yes
bridge-vids 20
bridge-pvid 1
cumulus@spine01:mgmt:~$ sudo cat /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp2
iface swp2
cumulus@server01:~$ sudo cat /etc/network/interfaces
# The loopback network interface
auto lo
iface lo inet loopback
# The OOB network interface
auto eth0
iface eth0 inet dhcp
# The data plane network interfaces
auto eth1
iface eth1 inet manual
address 10.1.10.101
netmask 255.255.255.0
mtu 9000
post-up ip route add 10.0.0.0/8 via 10.1.10.1
cumulus@server02:~$ sudo cat /etc/network/interfaces
auto lo
iface lo inet loopback
# The OOB network interface
auto eth0
iface eth0 inet dhcp
# The data plane network interfaces
auto eth2
iface eth2 inet manual
address 10.2.10.102
netmask 255.255.255.0
mtu 9000
post-up ip route add 10.0.0.0/8 via 10.2.10.1
cumulus@leaf01:mgmt:~$ sudo cat /etc/frr/frr.conf
...
interface lo
ip pim
interface swp51
ip pim
interface vlan10
ip igmp
ip igmp version 3
ip igmp query-interval 125
ip igmp last-member-query-interval 10
ip igmp query-max-response-time 100
ip pim
vrf default
ip pim rp 10.10.10.101 224.0.0.0/4
exit-vrf
vrf mgmt
exit-vrf
router bgp 65101 vrf default
bgp router-id 10.10.10.1
timers bgp 3 9
bgp deterministic-med
! Neighbors
neighbor swp51 interface remote-as external
neighbor swp51 timers 3 9
neighbor swp51 timers connect 10
neighbor swp51 advertisement-interval 0
neighbor swp51 capability extended-nexthop
! Address families
address-family ipv4 unicast
network 10.1.10.0/24
network 10.10.10.1/32
maximum-paths ibgp 64
maximum-paths 64
distance bgp 20 200 200
neighbor swp51 activate
exit-address-family
cumulus@leaf02:mgmt:~$ sudo cat /etc/frr/frr.conf
...
interface lo
ip pim
interface swp51
ip pim
interface vlan20
ip igmp
ip igmp version 3
ip igmp query-interval 125
ip igmp last-member-query-interval 10
ip igmp query-max-response-time 100
ip pim
vrf default
ip pim rp 10.10.10.101 224.0.0.0/4
exit-vrf
vrf mgmt
exit-vrf
router bgp 65102 vrf default
bgp router-id 10.10.10.2
timers bgp 3 9
bgp deterministic-med
! Neighbors
neighbor swp51 interface remote-as external
neighbor swp51 timers 3 9
neighbor swp51 timers connect 10
neighbor swp51 advertisement-interval 0
neighbor swp51 capability extended-nexthop
! Address families
address-family ipv4 unicast
network 10.10.10.2/32
network 10.2.10.0/24
maximum-paths ibgp 64
maximum-paths 64
distance bgp 20 200 200
neighbor swp51 activate
exit-address-family
This simulation starts with the example PIM configuration. To simplify the example, only one spine and two leafs are in the topology. The demo is pre-configured using NVUE commands.
To show the multicast routing table, run the NCLU net show mroute command on the FHR (leaf01), RP (spine01), or LHR (leaf02).
To see the active source on the RP, run the net show pim upstream command on spine01.
To show information about known S,Gs, the IIF and the OIL, run the net show pim state command on the FHR (leaf01), RP (spine01), or LHR (leaf02).
To further validate the configuration, run the PIM show commands listed in the troubleshooting section above.
Considerations
Cumulus Linux does not support non-native forwarding (register decapsulation). Expect initial packet loss while the PIM *,G tree is building from the RP to the FHR to trigger native forwarding.
Cumulus Linux does not build an S,G mroute when forwarding over an *,G tree.
GRE Tunneling
GRE is a tunneling protocol that encapsulates network layer protocols inside virtual point-to-point links over an Internet Protocol network. The tunnel source and tunnel destination addresses on each side identify the two endpoints.
GRE packets travel directly between the two endpoints through a virtual tunnel. As a packet comes across other routers, there is no interaction with its payload; the routers only parse the outer IP packet. When the packet reaches the endpoint of the GRE tunnel, the switch de-encapsulates the outer packet, parses the payload, then forwards it to its ultimate destination.
GRE uses multiple protocols over a single-protocol backbone and is less demanding than some of the alternative solutions, such as VPN. You can use GRE to transport protocols that the underlying network does not support, work around networks with limited hops, connect non-contiguous subnets, and allow VPNs across wide area networks.
You can use only static IPv4 routes as a destination for the tunnel interface.
You can only configure IPv4 endpoints.
You can only configure point to point GRE tunnels; only one remote tunnel per interface.
You cannot configure two tunnels with same local and remote tunnel IP address.
GRE tunnels cannot coexist with VXLAN or MPLS on the switch.
Cumulus Linux supports a maximum of 256 GRE tunnels.
You can only configure GRE tunnels in the default VRF.
GRE tunnels do not support layer 3 protocols, ECMP, QoS, ACLs or NAT.
All GRE tunnels share the same TTL value; Cumulus Linux uses the TTL value of the tunnel you configure last.
You cannot configure the MTU on GRE tunnel interfaces. The GRE tunnel MTU is the maximum supported MTU on the switch by default.
The following example shows two sites that use IPv4 addresses. Using GRE tunneling, the two end points can encapsulate an IPv4 or IPv6 payload inside an IPv4 packet. The switch routes the packet based on the destination in the outer IPv4 header.
Configure GRE Tunneling
To configure GRE tunneling, you create a GRE tunnel interface with routes for tunneling on both endpoints as follows:
Create a tunnel interface by specifying an interface name, the tunnel mode as gre, the source (local) and destination (remote) underlay IP address, and the ttl (optional).
Assign an IP address to the tunnel interface.
Add route entries to encapsulate the packets using the tunnel interface.
The following configuration example shows the commands used to set up a bidirectional GRE tunnel between two endpoints: tunnelR1 and tunnelR2. The local tunnel endpoint for tunnelR1 is 10.10.10.1 and the remote endpoint is 10.10.10.3. The local tunnel endpoint for tunnelR2 is 10.10.10.3 and the remote endpoint is 10.10.10.1.
In NVUE, if you create the GRE interface with a name that starts with tunnel, NVUE automatically sets the interface type to tunnel. If you create a GRE interface with a name that does not start with tunnel, you must set the interface type to tunnel with the nv set interface <interface-name> type tunnel command.
cumulus@leaf01:~$ nv set interface lo ip address 10.10.10.1/32
cumulus@leaf01:~$ nv set interface swp1 ip address 10.2.1.1/24
cumulus@leaf01:~$ nv set interface tunnelR2 ip address 10.1.100.1/30
cumulus@leaf01:~$ nv set interface tunnelR2 tunnel mode gre
cumulus@leaf01:~$ nv set interface tunnelR2 tunnel dest-ip 10.10.10.3
cumulus@leaf01:~$ nv set interface tunnelR2 tunnel source-ip 10.10.10.1
cumulus@leaf01:~$ nv set interface tunnelR2 tunnel ttl 255
cumulus@leaf01:~$ nv set vrf default router static 10.1.1.0/24 via tunnelR2
cumulus@leaf01:~$ nv config apply
cumulus@leaf03:~$ nv set interface lo ip address 10.10.10.3/32
cumulus@leaf03:~$ nv set interface swp1 ip address 10.1.1.1/24
cumulus@leaf03:~$ nv set interface tunnelR1 ip address 10.1.100.2/30
cumulus@leaf03:~$ nv set interface tunnelR1 tunnel mode gre
cumulus@leaf03:~$ nv set interface tunnelR1 tunnel dest-ip 10.10.10.1
cumulus@leaf03:~$ nv set interface tunnelR1 tunnel source-ip 10.10.10.3
cumulus@leaf03:~$ nv set interface tunnelR1 tunnel ttl 255
cumulus@leaf03:~$ nv set vrf default router static 10.2.1.0/24 via tunnelR1
cumulus@leaf03:~$ nv config apply
Edit the /etc/network /interfaces file to add the tunnel interface:
cumulus@leaf01:~$ sudo nano /etc/network/interfaces
...
auto lo
iface lo inet loopback
address 10.10.10.1/32
auto swp1
iface swp1
address 10.2.1.1/24
auto tunnelR2
iface tunnelR2
address 10.1.100.1/30
tunnel-mode gre
tunnel-local 10.10.10.1
tunnel-endpoint 10.10.10.3
tunnel-ttl 255
Run the ifreload -a command to load the configuration:
To delete a GRE tunnel, remove the tunnel interface, and remove the routes configured with the tunnel interface. Either run the NVUE nv unset commands or remove the tunnel configuration from the /etc/network/interfaces file and run the ifreload -a command.
Troubleshooting
To check GRE tunnel settings, run the NVUE nv show interface <interface> tunnel command, or run the Linux ip tunnel show or ifquery --check command. For example:
cumulus@leaf01:mgmt:~$ nv show interface tunnelR2 tunnel
operational applied description
--------- ----------- ---------- -------------------------------
dest-ip 10.10.10.3 10.10.10.3 Destination underlay IP address
mode gre gre tunnel mode
source-ip 10.10.10.1 10.10.10.1 Source underlay IP address
ttl 255 time to live
cumulus@leaf01:mgmt:~$ ip tunnel show
gre0: gre/ip remote any local any ttl inherit nopmtudisc
tunnelR2: gre/ip remote 10.10.10.3 local 10.10.10.1 ttl 255
cumulus@leaf01:mgmt:~$ sudo cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.1/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
address 10.2.1.1/24
auto swp51
iface swp51
auto swp52
iface swp52
auto tunnelR2
iface tunnelR2
address 10.1.100.1/30
tunnel-mode gre
tunnel-local 10.10.10.1
tunnel-endpoint 10.10.10.3
tunnel-ttl 255
cumulus@leaf03:mgmt:~$ sudo cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.3/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
address 10.1.1.1/24
auto swp51
iface swp51
auto swp52
iface swp52
auto tunnelR1
iface tunnelR1
address 10.1.100.2/30
tunnel-mode gre
tunnel-local 10.10.10.3
tunnel-endpoint 10.10.10.1
tunnel-ttl 255
cumulus@spine01:mgmt:~$ sudo cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.101/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp3
iface swp3
cumulus@spine02:mgmt:~$ sudo cat /etc/network/interfaces
auto lo
iface lo inet loopback
address 10.10.10.102/32
auto mgmt
iface mgmt
address 127.0.0.1/8
address ::1/128
vrf-table auto
auto eth0
iface eth0 inet dhcp
ip-forward off
ip6-forward off
vrf mgmt
auto swp1
iface swp1
auto swp3
iface swp3
cumulus@server01:mgmt:~$ sudo cat /etc/network/interfaces
auto eth0
iface eth0 inet dhcp
post-up sysctl -w net.ipv6.conf.eth0.accept_ra=2
auto eth1
iface eth1
address 10.2.1.2/24
post-up ip route add 10.0.0.0/8 via 10.2.1.1
cumulus@server04:mgmt:~$ sudo cat /etc/network/interfaces
auto eth0
iface eth0 inet dhcp
post-up sysctl -w net.ipv6.conf.eth0.accept_ra=2
auto eth1
iface eth1
address 10.1.1.2/24
post-up ip route add 10.0.0.0/8 via 10.1.1.1
This simulation starts with the example GRE configuration. The demo is pre-configured using NVUE commands.
To validate the configuration, run the commands listed in the troubleshooting section.
Network Address Translation - NAT
Network Address Translation (NAT) enables your network to use one set of IP addresses for internal traffic and a second set of addresses for external traffic.
NAT overcomes addressing problems due to the explosive growth of the Internet. In addition to preventing the depletion of IPv4 addresses, NAT enables you to use the private address space internally and still have a way to access the Internet.
Cumulus Linux supports both static NAT and dynamic NAT. Static NAT provides a permanent mapping between one private IP address and a single public address. Dynamic NAT maps private IP addresses to public addresses; these public IP addresses come from a pool. Cumulus Linux creates the translations as needed dynamically, so that a large number of private addresses can share a smaller pool of public addresses.
Static and dynamic NAT both support:
Basic NAT, which only translates the IP address in the packet: the source IP address in the outbound direction and the destination IP address in the inbound direction.
Port Address Translation (PAT), which translates both the IP address and layer 4 port: the source IP address and port in the outbound direction and the destination IP address and port in the inbound direction.
Static NAT supports double NAT (also known as twice NAT) where the switch translates both the source and destination IP addresses as a packet crosses address realms. You use double NAT when the address space in a private network overlaps with IP addresses in the public space.
The following illustration shows a basic NAT configuration.
NVIDIA Spectrum-2 and Spectrum-3 switches only support NAT.
You can configure NAT on physical and bond interfaces only; logical interfaces such as the loopback, SVIs, and subinterfaces do not support NAT.
You can only configure NAT in the default VRF.
You can enable both static NAT and dynamic NAT at the same time.
You cannot translate IPv6 rules to IPv4 rules.
NAT does not support multicast traffic.
Static NAT
Static NAT provides a one-to-one mapping between a private IP address inside your network and a public IP address. For example, if you have a web server with the private IP address 10.0.0.10 and you want a remote host to make a request to the web server using the IP address 172.30.58.80, you configure a static NAT mapping between the two IP addresses.
Static NAT entries do not time out from the translation table.
Configure Static NAT
For static NAT, create a rule that matches a source or destination IP address and translates the IP address to a public IP address.
For static PAT, create a rule that matches a source or destination IP address together with the layer 4 port and translates the IP address and port to a public IP address and port.
For NVIDIA switches with Spectrum-2 and later, you can include the outgoing or incoming interface.
To create rules, use cl-acltool.
To add NAT rules using cl-acltool, either edit an existing file in the /etc/cumulus/acl/policy.d directory and add rules under [iptables] or create a new file in the /etc/cumulus/acl/policy.d directory and add rules under an [iptables] section. For example:
The following rule matches UDP packets with source IP address 10.0.0.1 and source port 5000, and translates the IP address to 172.30.58.80 and the port to 6000.
The following rule matches UDP packets with destination IP address 172.30.58.80 and destination port 6000 on interface swp51, and translates the IP address to 10.0.0.1 and the port to 5000.
The following double NAT rule translates both the source and destination IP addresses of incoming and outgoing ICMP packets:
For outgoing messages, NAT changes the inside local IP address 172.16.10.2 to the inside global IP address 130.1.100.10 and the outside local IP address 26.26.26.26 to the outside global IP address 140.1.1.2.
For incoming messages, NAT changes the inside global IP address 130.1.100.10 to the inside local IP address 172.16.10.2 and the outside global IP address 140.1.1.2 to the outside local IP address 26.26.26.26.
When you configure a static SNAT rule for outgoing traffic, you must also configure a static DNAT rule for the reverse traffic so that traffic goes in both directions.
To delete a static NAT rule, remove the rule from the policy file in the /etc/cumulus/acl/policy.d directory, then run the sudo cl-acltool -i command.
Dynamic NAT
Dynamic NAT maps private IP addresses and ports to a public IP address and port range or a public IP address range and port range. Cumulus Linux assigns IP addresses from a pool of addresses dynamically. When the switch releases entries after a period of inactivity, it maps new incoming connections dynamically to the freed up addresses and ports.
Enable Dynamic NAT
To enable dynamic NAT, edit the /etc/cumulus/switchd.conf file and uncomment the nat.dynamic_enable = TRUE option:
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Optional Dynamic NAT Settings
The /etc/cumulus/switchd.conf file includes the following configuration options for dynamic NAT. Only change these options if you enable dynamic NAT.
Option
Description
nat.age_poll_interval
The period of inactivity before switchd releases a NAT entry from the translation table. The default value is 5 minutes. The minimum value is 1 minute. The maximum value is 24 hours.
nat.table_size
The maximum number of dynamic snat and dnat entries in the translation table. The default value is 1024. NVIDIA Spectrum-2 switches support a maximum of 8192 entries.
nat.config_table_size
The maximum number of rules allowed. The default value is 64. The minimum value is 64. The maximum value for the NVIDIA Spectrum-2 switch is 1024. The maximum value for the NVIDIA Spectrum-3 switch is 8192.
After you change any of the dynamic NAT configuration options, restart switchd.
Restarting the switchd service causes all network ports to reset, interrupting network services, in addition to resetting the switch hardware configuration.
Configure Dynamic NAT
For dynamic NAT, create a rule that matches a IP address in CIDR notation and translates the address to a public IP address or IP address range.
For dynamic PAT, create a rule that matches an IP address in CIDR notation and translates the address to a public IP address and port range or an IP address range and port range. You can also match on an IP address in CIDR notation and port.
For NVIDIA Spectrum-2 switches, you can include the outgoing or incoming interface in the rule. See the examples below.
To add NAT rules using cl-acltool, either edit an existing file in the /etc/cumulus/acl/policy.d directory and add rules under [iptables] or create a new file in the /etc/cumulus/acl/policy.d directory and add rules under an [iptables] section. For example:
The following rule matches TCP packets with source IP address in the range 10.0.0.0/24 on outbound interface swp5 and translates the address dynamically to an IP address in the range 172.30.58.0-172.30.58.80.
The following rule matches UDP packets with source IP address in the range 10.0.0.0/24 and translates the addresses dynamically to IP address 172.30.58.80 with layer 4 ports in the range 1024-1200:
The following rule matches UDP packets with source IP address in the range 10.0.0.0/24 on source port 5000 and translates the addresses dynamically to IP address 172.30.58.80 with layer 4 ports in the range 1024-1200:
The following rule matches TCP packets with destination IP address in the range 10.1.0.0/24 and translates the address dynamically to IP address range 172.30.58.0-172.30.58.80 with layer 4 ports in the range 1024-1200:
The following rule matches ICMP packets with source IP address in the range 10.0.0.0/24 and destination IP address in the range 10.1.0.0/24. The rule translates the address dynamically to IP address range 172.30.58.0-172.30.58.80 with layer 4 ports in the range 1024-1200:
To delete a dynamic NAT rule, remove the rule from the policy file in the /etc/cumulus/acl/policy.d directory, then run the sudo cl-acltool -i command.
Show Configured NAT Rules
To see the NAT rules configured on the switch, run the sudo iptables -t nat -v -L or the
sudo cl-acltool -L ip -v command. For example:
cumulus@switch:~$ sudo iptables -t nat -v -L -n
...
Chain POSTROUTING (policy ACCEPT 27 packets, 3249 bytes)
pkts bytes target prot opt in out source destination
0 0 SNAT tcp -- any any 10.0.0.1 anywhere to:172.30.58.80
Show Conntrack Flows
To see the active connection tracking (conntrack) flows, run the sudo cat /proc/net/nf_conntrack command. The hardware offloaded flows contain [OFFLOAD] in the output.
When using NAT, you must enable proxy ARP for intra-subnet ARP requests when:
The addresses you define in the static NAT and source NAT pool are in the same subnet as the ingress interface.
The addresses in the original destination address entry in the destination NAT rules are in the same subnet as the ingress interface.
To enable proxy ARP for intra-subnet ARP requests:
Cumulus Linux does not provide NVUE commands for this setting.
Edit the /etc/network/interfaces file to set /proc/sys/net/ipv4/conf/<interface>/proxy_arp_pvlan to 1 in the interface stanza, then run the ifreload -a command.
BFD provides low overhead and rapid detection of failures in the paths between two network devices. It provides a unified mechanism for link detection over all media and protocol layers. Use BFD to detect failures for IPv4 and IPv6 single or multihop paths between any two network devices, including unidirectional path failure detection.
Cumulus Linux does not support:
BFD demand mode
Dynamic BFD timer negotiation on an existing session. Any change to the timer values takes effect only when the session goes down and comes back up.
BFD Multihop Routed Paths
BFD multihop sessions build over arbitrary paths between two systems, which results in some complexity that does not exist for single hop sessions. To avoid spoofing with multihop paths, configure the maximum hop count (max_hop_cnt) for each peer, which limits the number of hops for a BFD session. The switch drops all BFD packets exceeding the maximum hop count.
Cumulus Linux supports multihop BFD sessions for both IPv4 and IPv6 peers.
Configure BFD
You can configure BFD with NVUE or vtysh commands or by specifying the configuration in the PTM `topology.dot` file. However, the topology file has some limitations:
The topology file supports BFD IPv4 and IPv6 single hop sessions only; you cannot specify IPv4 or IPv6 multihop sessions in the topology file.
The topology file supports BFD sessions for only link-local IPv6 peers; BFD sessions for global IPv6 peers discovered on the link are not created.
Use FRR to register multihop peers with PTM and BFD, and monitor the connectivity to the remote BGP multihop peer. FRR can dynamically register and unregister both IPv4 and IPv6 peers with BFD when the BFD-enabled peer connectivity starts or stops. Also, you can configure BFD parameters for each BGP or OSPF peer.
The BFD parameter in the topology file takes precedence over the client-configured BFD parameters for a BFD session that both the topology file and FRR creates.
Every BFD interface requires an IP address. The neighbor IP address for a single hop BFD session must exist in the ARP table before BFD can start sending control packets.
When you configure BFD, you can set the following parameters for both IPv4 and IPv6 sessions. If you do not set these parameters, Cumulus Linux uses the default values.
The required minimum interval between the received BFD control packets. The default value is 300ms.
The minimum interval for transmitting BFD control packets. The default value is 300ms.
The detection time multiplier. The default value is 3.
BFD in BGP
When you configure BFD in BGP, PTM registers and de-registers neighbors dynamically.
To configure BFD in BGP, run the following commands.
You can configure BFD for a peer group or for an individual neighbor.
The following example configures BFD for swp51 and uses the default intervals.
cumulus@switch:~$ nv set vrf default router bgp neighbor swp51 bfd enable on
cumulus@switch:~$ nv config apply
The following example configures BFD for the peer group fabric and sets the interval multiplier to 4, the minimum interval between received BFD control packets to 400, and the minimum interval for sending BFD control packets to 400.
The following example configures BFD for the peer group fabric and sets the interval multiplier to 4, the minimum interval between received BFD control packets to 400, and the minimum interval for sending BFD control packets to 400.
To see neighbor information in BGP, including BFD status, run the vtysh show ip bgp neighbor <interface> command or the net show bgp neighbor <interface> command. For example:
cumulus@switch:~$ sudo vtysh
switch# show ip bgp neighbor swp51
...
BFD: Type: single hop
Detect Mul: 4, Min Rx interval: 400, Min Tx interval: 400
Status: Down, Last update: 0:00:00:08
...
BFD in OSPF
When you enable or disable BFD in OSPF, PTM registers and de-registers neighbors dynamically. When you enable BFD on the interface, a neighbor registers with BFD when two-way adjacency starts and de-registers when adjacency goes down. The BFD configuration is per interface and any IPv4 and IPv6 neighbors discovered on that interface inherit the configuration.
The following example configures BFD in OSPF for interface swp1 and sets interval multiplier to 4, the minimum interval between received BFD control packets to 400, and the minimum interval for sending BFD control packets to 400.
The vtysh commands save the configuration in the /etc/frr/frr.conf file. For example:
...
interface swp1
ipv6 ospf6 bfd 4 400 400
...
You can run different commands to show neighbor information in OSPF, including BFD status.
To show IPv6 OSPF interface information, run the vtysh show ip ospf6 interface <interface> command.
To show IPv4 OSPF interface information, run the vtysh show ip ospf interface <interface> command.
The following example shows IPv6 OSPF interface information.
cumulus@switch:~$ sudo vtysh
switch# show ip ospf6 interface swp2s0
swp2s0 is up, type BROADCAST
Interface ID: 4
Internet Address:
inet : 11.0.0.21/30
inet6: fe80::4638:39ff:fe00:6c8e/64
Instance ID 0, Interface MTU 1500 (autodetect: 1500)
MTU mismatch detection: enabled
Area ID 0.0.0.0, Cost 10
State PointToPoint, Transmit Delay 1 sec, Priority 1
Timer intervals configured:
Hello 10, Dead 40, Retransmit 5
DR: 0.0.0.0 BDR: 0.0.0.0
Number of I/F scoped LSAs is 2
0 Pending LSAs for LSUpdate in Time 00:00:00 [thread off]
0 Pending LSAs for LSAck in Time 00:00:00 [thread off]
BFD: Detect Mul: 3, Min Rx interval: 300, Min Tx interval: 300
To show IPv6 OSPF neighbor details, run the vtysh show ip ospf6 neighbor detail command.
To show IPv4 OSPF interface information, run the vtysh show ip ospf neighbor detail command.
The following example shows IPv6 OSPF neighbor details.
cumulus@switch:~$ sudo vtysh
switch# show ip ospf6 neighbor detail
Neighbor 0.0.0.4%swp2s0
Area 0.0.0.0 via interface swp2s0 (ifindex 4)
His IfIndex: 3 Link-local address: fe80::202:ff:fe00:a
State Full for a duration of 02:32:33
His choice of DR/BDR 0.0.0.0/0.0.0.0, Priority 1
DbDesc status: Slave SeqNum: 0x76000000
Summary-List: 0 LSAs
Request-List: 0 LSAs
Retrans-List: 0 LSAs
0 Pending LSAs for DbDesc in Time 00:00:00 [thread off]
0 Pending LSAs for LSReq in Time 00:00:00 [thread off]
0 Pending LSAs for LSUpdate in Time 00:00:00 [thread off]
0 Pending LSAs for LSAck in Time 00:00:00 [thread off]
BFD: Type: single hop
Detect Mul: 3, Min Rx interval: 300, Min Tx interval: 300
Status: Up, Last update: 0:00:00:20
Scripts
ptmd executes scripts at /etc/ptm.d/bfd-sess-down when BFD sessions go down and /etc/ptm.d/bfd-sess-up when BFD sessions goes up. Modify these default scripts as needed.
Echo Function
Cumulus Linux supports the echo function for IPv4 single hops only, and with the asynchronous operating mode only (Cumulus Linux does not support demand mode).
Use the echo function to test the forwarding path on a remote system. To enable the echo function, set echoSupport to 1 in the topology file.
After the remote system loops the echo packets, the BFD control packets can send at a much lower rate. You configure this lower rate by setting the slowMinTx parameter in the topology file to a non-zero value in milliseconds.
You can use more aggressive detection times for echo packets because the round-trip time is less; echo packets access the forwarding path. You can configure the detection interval by setting the echoMinRx parameter in the topology file. The minimum setting is 50 milliseconds. After you configure this setting, BFD control packets send at this required minimum echo Rx interval. This indicates to the peer that the local system can loop back the echo packets. Echo packets transmit if the peer supports receiving echo packets.
About the Echo Packet
Cumulus Linux encapsulates BFD echo packets into UDP packets over destination and source UDP port number 3785. The BFD echo packet format is vendor-specific. BFD echo packets that originate from Cumulus Linux are eight bytes long and have the following format:
0
1
2
3
Version
Length
Reserved
Reserved
My Discriminator
Where:
Version is the version of the BFD echo packet.
Length is the length of the BFD echo packet.
My Discriminatoris a non-zero value that uniquely identifies a BFD session on the transmitting side. When the originating node receives the packet after being looped back by the receiving system, this value uniquely identifies the BFD session.
Transmit and Receive Echo Packets
Cumulus Linux transmits BFD echo packets for a BFD session only when the peer advertises a non-zero value for the required minimum echo receive interval (the echoMinRx setting) in the BFD control packet when the BFD session starts. The switch bases the transmit rate of the echo packets on the peer advertised echo receive value in the control packet.
Cumulus Linux loops BFD echo packets back to the originating node for a BFD session only if you configure the echoMinRx and echoSupport locally to a non-zero values.
Echo Function Parameters
You configure the echo function by setting the following parameters in the topology file at the global, template and port level:
echoSupport enables and disables echo mode. Set to 1 to enable the echo function. It defaults to 0 (disable).
echoMinRx is the minimum interval between echo packets the local system is capable of receiving. The BFD control packet advertises this value. When you enable the echo function, it defaults to 50. If you disable the echo function, this parameter is automatically 0, which indicates the port or the node cannot process or receive echo packets.
slowMinTx is the minimum interval between transmitting BFD control packets when the switch exchanges echo packets.
Troubleshooting
To troubleshoot BFD, run the net show bfd detail command or the Linux ptmctl -b command.
cumulus@switch:~$ net show bfd detail
----------------------------------------------------------------------------------------
port peer state local type diag det tx_timeout rx_timeout
mult
----------------------------------------------------------------------------------------
swp1 fe80::202:ff:fe00:1 Up N/A singlehop N/A 3 300 900
swp1 3101:abc:bcad::2 Up N/A singlehop N/A 3 300 900
#continuation of output
---------------------------------------------------------------------
echo echo max rx_ctrl tx_ctrl rx_echo tx_echo
tx_timeout rx_timeout hop_cnt
---------------------------------------------------------------------
0 0 N/A 187172 185986 0 0
0 0 N/A 501 533 0 0
ARP is a communication protocol that discovers the link layer address, such as a MAC address, associated with a network layer address. The Cumulus Linux ARP implementation differs from standard Debian Linux ARP behavior because Cumulus Linux is an operating system for routers and switches, not servers.
Standard Debian ARP Behavior and the Tunable ARP Parameters
Debian has these five tunable ARP parameters:
arp_accept
arp_announce
arp_filter
arp_ignore
arp_notify
For a full description of these parameters, refer to the Linux documentation.
The standard Debian installation sets these ARP parameters to 0, leaving the router as wide open and unrestricted as possible. The Linux IP addresses are a property of the device, not an individual interface. Therefore, you can send an ARP request or reply on one interface with an address that resides on a different interface. While this unrestricted behavior makes sense for a server, it is not the normal behavior of a router. Routers expect the MAC and IP address mappings that ARP provides to match the physical topology, so that the IP addresses match the interfaces on which they reside. With these tunable ARP parameters, Cumulus Linux is able to specify the behavior to match the expectations of a router.
ARP Tunable Parameter Settings in Cumulus Linux
Parameter
Default Setting
Type
Description
arp_accept
0
BOOL
Defines the behavior for gratuitous ARP frames when the IP address is not already in the ARP table:
0: Do not create new entries in the ARP table.
1: Create new entries in the ARP table.
You can set arp_accept on an individual interface which differs from the rest of the switch (see below).
arp_announce
2
INT
Defines different restriction levels for announcing the local source IP address from IP packets in ARP requests that send on an interface:
0: Use any local address configured on any interface.
1: Avoid local addresses that are not in the target subnet for this interface. You can use this mode when target hosts reachable through this interface require the source IP address in ARP requests to be part of their logical network configured on the receiving interface. When Cumulus Linux generates the request, it checks all subnets that include the target IP address and preserves the source address if it is from such a subnet. If there is no such subnet, Cumulus Linux selects the source address according to the rules for level 2.
2: Always use the best local address for this target. In this mode, Cumulus Linux ignores the source address in the IP packet and tries to select the local address preferred for talks with the target host. To select the local address, Cumulus Linux looks for primary IP addresses on all the subnets on the outgoing interface that include the target IP address. If there is no suitable local address, Cumulus Linux selects the first local address on the outgoing interface or on all other interfaces, so that it receives a reply for the request regardless of the announced source IP address.
The default Debian behavior (arp_announce is 0) sends gratuitous ARPs or ARP requests using any local source IP address and does not limit the IP source of the ARP packet to an address residing on the interface that sends the packet.
Routers expect a different relationship between the IP address and the physical network. Adjoining routers look for MAC and IP addresses to reach a next hop residing on a connecting interface for transiting traffic. By setting the arp_announce parameter to 2, Cumulus Linux uses the best local address for each ARP request, preferring the primary addresses on the interface that sends the ARP.
arp_filter
0
BOOL
0: The kernel can respond to ARP requests with addresses from other interfaces to increase the chance of successful communication. The complete host on Linux (not specific interfaces) owns the IP addresses. For more complex configurations, such as load balancing, this behavior can cause problems.
1: Allows you to have multiple network interfaces on the same subnet and to answer the ARPs for each interface based on whether the kernel routes a packet from the ARPd IP address out of that interface (you must use source based routing).
arp_filter for the interface is on if at least one of conf/{all,interface}/arp_filter is TRUE, it is off otherwise.
Cumulus Linux uses the default Debian Linux arp_filter setting of 0. The switch uses arp_filter when multiple interfaces reside in the same subnet and allows certain interfaces to respond to ARP requests. For OSPF with IP unnumbered interfaces, multiple interfaces appear in the same subnet and contain the same address. If you use multiple interfaces between a pair of routers and set arp_filter to 1, forwarding can fail.
The arp_filter parameter allows a response on any interface in the subnet, where the arp_ignore setting (below) limits cross-interface ARP behavior.
arp_ignore
1
INT
Defines different modes for sending replies in response to received ARP requests that resolve local target IP addresses:
0: Reply for any local target IP address on any interface.
1: Reply only if the target IP address is the local address on the incoming interface.
2: Reply only if the target IP address is the local address on the incoming interface and the sender IP address is part of same subnet on this interface.
3: Do not reply for local addresses with scope host; the switch replies only for global and link addresses.
4-7: Reserved.
8: Do not reply for all local addresses.
The switch uses the maximum value from conf/{all,interface}/arp_ignore when the {interface} receives the ARP request.
The default arp_ignore setting of 1 allows the device to reply to an ARP request for any IP address on any interface. While this matches the expectation that an IP address belongs to the device, not an interface, it can cause some unexpected behavior on a router.
For example, if arp_ignore is 0 and the switch receives an ARP request on one interface for the IP address residing on a different interface, the switch responds with an ARP reply even if the interface of the target address is down. This can cause traffic loss because the switch does not know if it can reach the next hops and results in troubleshooting challenges for failure conditions.
If you set arp_ignore to 2, the switch only replies to ARP requests if the target IP address is a local address and both the sender and target IP addresses are part of the same subnet on the incoming interface. The router does not create stale neighbor entries when a peer device sends an ARP request from a source IP address that is not on the connected subnet. Eventually, the switch sends ARP requests to the host to try to keep the entry fresh. If the host responds, the switch now has reachable neighbor entries for hosts that are not on the connected subnet.
arp_notify
1
BOOL
Defines the mode to notify address and device changes.
0: Do nothing.
1: Generate gratuitous ARP requests when the device comes up or the hardware address changes.
The default Debian arp_notify setting is to remain silent when an interface comes up or the hardware address changes. Because Cumulus Linux often acts as a next hop for several end hosts, it notifies attached devices when an interface comes up or the address changes, which speeds up new information convergence and provides the most rapid support for changes.
Change Tunable ARP Parameters
You can change the ARP parameter settings in several places, including:
/proc/sys/net/ipv4/conf/all/arp* (all interfaces)
/proc/sys/net/ipv4/conf/default/arp* (default for future interfaces)
The ARP parameter changes in Cumulus Linux use the default file locations.
The all and default locations sound similar but they operate in different ways. The all location can potentially change the value for all interfaces running IP, both now and in the future. The all value applies to each parameter using either MAX or OR logic between the all and any port-specific settings, as the following table shows:
ARP Parameter
Condition
arp_accept
OR
arp_announce
MAX
arp_filter
OR
arp_ignore
MAX
arp_notify
MAX
For example, if you set the /proc/sys/net/conf/all/arp_ignore value to 1 and the /proc/sys/net/conf/swp1/arp_ignore value to 0 to try to disable it on a per-port basis, interface swp1 still uses the value of 1; the port-specific setting does not override the global all setting. Instead, the MAX value between the all value and port-specific value defines the actual behavior.
The default location /proc/sys/net/ipv4/conf/default/arp* defines the values for all future IP interfaces. Changing the default setting of an ARP parameter does not impact interfaces that already have an IP address. If you make changes to a running system that already has assigned IP addresses, use port-specific settings instead.
Cumulus Linux copies the value of the default parameter to every port-specific location, excluding those that already have an IP address. There is no complicated logic between the default setting and the port-specific setting (unlike the all location).
To determine the current ARP parameter settings for each of the locations, run the following commands:
To make the change persist through reboots, edit the /etc/sysctl.d/arp.conf file and add your port-specific ARP setting.
Configure Proxy ARP
When you enable proxy ARP, if the switch receives an ARP request for which it has a route to the destination IP address, the switch sends a proxy ARP reply that contains its own MAC address. The host that sent the ARP request then sends its packets to the switch and the switch forwards the packets to the intended host.
Proxy ARP works with IPv4 only; ARP is an IPv4-only protocol.
The following example commands enable proxy ARP on swp1.
Cumulus Linux does not provide NVUE commands for this setting.
Edit the /etc/network/interfaces file to set /proc/sys/net/ipv4/conf/<interface>/proxy_arp to 1 in the interface stanza, then run the ifreload -a command.
If you are running two interfaces in the same broadcast domain (typically seen when using VRR, which creates a -v0 interface in the same broadcast domain), set /proc/sys/net/ipv4/conf/<INTERFACE>/medium_id to 2 on both the base SVI interface and the -v0 interface so that only one of the two interfaces replies when getting an ARP request. This prevents the v0 interface from proxy replying on behalf of the SVI (and the SVI from proxy replying on behalf of the v0 interface). You can only prevent duplicate replies when the ARP request is for the SVI or the v0 interface directly.
Cumulus Linux does not provide NVUE commands for this setting.
Edit the /etc/network/interfaces file, then run the ifreload -a command. For example:
If you are running proxy ARP on a VRR interface, add a post-up line to the VRR interface stanza similar to the following. For example, if vlan100 is the VRR interface for the configuration above:
Cumulus Linux does not provide NVUE commands for this setting.
Edit the /etc/network/interfaces file, then run the ifreload -a command. For example:
In centralized VXLAN environments with ARP and ND suppression, if the SVIs on the leafs but do not have an IP address within the subnet, problems with the Duplicate Address Detection process on Microsoft Windows hosts occur. For example, in a pure layer 2 scenario or with SVIs that have the ip-forward option off, the SVI does not have an IP address. The neighmgrd service selects a source IP address for an ARP probe based on the subnet match on the neighbor IP address. Because the SVI that learns this neighbor does not have an IP address, the subnet match fails and neighmgrd uses UNSPEC (0.0.0.0 for IPv4) as the source IP address in the ARP probe.
To work around this issue, run the neighmgrctl setsrcipv4 <ipaddress> command to specify a non-0.0.0.0 address for the source; for example:
cumulus@switch:~$ neighmgrctl setsrcipv4 10.1.0.2
The configuration above does not persist if you reboot the switch. To make the changes apply persistently:
Create a new file called /etc/cumulus/neighmgr.conf and add the setsrcipv4 <ipaddress> option; for example:
Cumulus Linux does not interact directly with end systems as much as end systems interact with each another. Therefore, after ARP places a neighbor into a reachable state, if Cumulus Linux does not interact with the client again for a long enough period of time, the neighbor can move into a stale state. To keep neighbors in the reachable state, Cumulus Linux includes a background process (/usr/bin/neighmgrd). The background process can track neighbors that move into a stale, delay, or probe state, and attempt to refresh their state before removing them from the Linux kernel and from hardware forwarding. If you want the neighmgrd process to add a neighbor if the sender IP address in the ARP packet is in one of the SVI’s subnets, create the /etc/cumulus/neighmgr.conf file and add the subnet_checks=1 parameter under the [snooper] header. By default, the subnet_checks option is set to 0 (disabled) so that neighmgrd allows out-of-network neighbors to be processed from SVIs.
The ARP refresh timer defaults to 1080 seconds (18 minutes).
ND allows different devices on the same link to advertise their existence to their neighbors and to learn about the existence of their neighbors. ND is the IPv6 equivalent of IPv4 ARP for layer 2 address resolution.
ND is on by default. Cumulus Linux provides a set of configuration options to support IPv6 networks and adjust your security settings.
ND Configuration Options
Cumulus Linux provides options to configure:
Router Advertisement
IPv6 prefixes
Recursive DNS servers
DNS Search Lists
Home Agents
MTU for neighbor discovery messages
Router Advertisement
Router Advertisement is disabled by default. To enable Router Advertisment for an interface:
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery router-advertisement enable on
cumulus@leaf01:mgmt:~$ nv config apply
Allow consecutive Router Advertisement packets to transmit more frequently than every three seconds (fast retransmit). You can set this parameter to on or off. The default setting is on.
Set the hop limit value advertised in a Router Advertisement message. You can set a value between 0 and 255. The default value is 64.
Set the interval between unsolicited multicast router advertisements from the interface. You can set a value between 70 and 1800000 miliseconds. The default value is 600000 miliseconds.
Set the maximum amount of time that Router Advertisement messages can exist on the route. You can set a value between 0 and 9000 seconds. The default value is 1800.
Allow a dynamic host to use a managed protocol, such as DHCPv6 to configure IP addresses automatically (managed configuration). Set this parameter to on or off. By default, this parameter is not set.
Allow a dynamic host to use a managed protocol to configure additional information through DHCPv6. Set this parameter to on or off. By default, this parameter is not set.
Set the amount of time that an IPv6 node is reachable. You can set a value between 0 and 3600000 milliseconds. The default value is 0.
Set the interval at which neighbor solicitation messages retransmit. You can set a value between 0 and 4294967295 milliseconds. The default value is 0.
Allow hosts to use router preference to select the default router. You can set a value of high, medium, or low. The default value is medium.
The following example commands set:
The Router Advertisement interval to 60000 milliseconds (60 seconds).
The router preference to high.
The amount of time that an IPv6 node is reachable to 3600000.
The interval at which neighbor solicitation messages retransmit to 4294967295.
The hop limit value in the Router Advertisement message to 100.
The maximum amount of time that Router Advertisement messages exist on the route to 4000.
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery router-advertisement interval 60000
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery router-advertisement router-preference high
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery router-advertisement reachable-time 3600000
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery router-advertisement retransmit-time 4294967295
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery router-advertisement hop-limit 100
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery router-advertisement lifetime 4000
cumulus@leaf01:mgmt:~$ nv config apply
The following example commands set fast retransmit to off and managed configuration to on:
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery router-advertisement fast-retransmit off
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery router-advertisement managed-config on
cumulus@leaf01:mgmt:~$ nv config apply
To configure IPv6 prefixes, you must specify the IPv6 prefixes you want to include in router advertisements. In addition, you can configure these optional settings:
Set the amount of time that the prefix is valid for on-link determination. You can set a value between 0 and 4294967295 seconds. The default value is 2592000.
Set the amount of time that addresses generated from a prefix remain preferred. You can set a value between 0 and 4294967295 seconds. The default value is 604800.
Enable adverisement to make no statement about prefix on-link or off-link properties. By default, this setting is off.
Enable the specified prefix to use IPv6 autoconfiguration. By default, this setting is on.
Indicate to hosts on the local link that the specified prefix contains a complete IP address by setting the R flag. By default, this setting is off.
The following example commands set the IPv6 prefix to 2001:db8:1::100/32, the amount of time that the prefix is valid for on-link determination to 2000000000, and the amount of time that addresses generated from a prefix remain preferred to 1000000000.
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery prefix 2001:db8:1::100/32 valid-lifetime 2000000000
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery prefix 2001:db8:1::100/32 preferred-lifetime 1000000000
cumulus@leaf01:mgmt:~$ nv config apply
The following example commands set advertisement to make no statement about prefix on-link or off-link properties, enable the specified prefix to use IPv6 autoconfiguration, and indicate to hosts on the local link that the specified prefix contains a complete IP address.
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery prefix 2001:db8:1::100/32 off-link on
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery prefix 2001:db8:1::100/32 autoconfig on
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery prefix 2001:db8:1::100/32 router-address on
cumulus@leaf01:mgmt:~$ nv config apply
To configure recursive DNS servers (RDNSS), you must specify the IPv6 address of each RDNSS you want to advertise.
An optional parameter lets you set the maximum amount of time you want to use the RDNSS for domain name resolution. You can set a value between 0 and 4294967295 seconds or use the keyword infinte to set the time to never expire. If you set the value to 0, Cumulus Linux no longer advertises the RDNSS address.
The following example commands set the RDNSS address to 2001:db8:1::100 and the lifetime to infinite:
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery rdnss 2001:db8:1::100 lifetime infinite
cumulus@leaf01:mgmt:~$ nv config apply
To configure DNS search lists (DNSSL), you must specify the domain suffix you want to advertise.
An optional parameter lets you set the maximum amount of time you want to use the domain suffix for domain name resolution. You can set a value between 0 and 4294967295 seconds or use the keyword infinte to set the time to never expire. If you set the value to 0, the host does not use the DNSSL.
The following example command sets the domain suffix to accounting.nvidia.com and the maximum amount of time you want to use the domain suffix to infinite:
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery dnssl accounting.nvidia.com lifetime infinite
cumulus@leaf01:mgmt:~$ nv config apply
Mobile IPv6 defines an additional flag in the router advertisement message that indicates if the advertising router is capable of being a Home Agent. Each Home Agent on the home link sets this flag when it sends router advertisements.
You can configure the switch to be a Home Agent with these settings:
Set the maximum amount of time you want the router to act as a Home Agent. You can set a value between 0 and 65520 seconds. The default value is 0 (the router is not a Home Agent).
Set the Home Agent router preference. You can set a value between 0 and 65535. The default value is 0 (the lowest preference).
The following example commands configure the switch as a Home Agent by setting the maximum amount of time the router acts as a Home Agent to 20000 seconds and the router preference to 100:
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery home-agent preference 100
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery home-agent lifetime 20000
cumulus@leaf01:mgmt:~$ nv config apply
When you run the above commands, NVUE adds the ipv6 nd home-agent-config-flag line under the interface stanza in the /etc/network/interfaces file in addition to the ipv6 nd home-agent-preference and ipv6 nd home-agent-lifetime lines.
To disable ND, run the NVUE nv set interface <interface> ip neighbor-discovery enable off command:
cumulus@leaf01:mgmt:~$ nv set interface swp1 ip neighbor-discovery enable off
cumulus@leaf01:mgmt:~$ nv config apply
Troubleshooting
To show the ND settings for an interface, run the NVUE nv show interface <interface-id> ip neighbor-discovery command:
cumulus@leaf01:mgmt:~$ nv show interface swp1 ip neighbor-discovery
applied description
-------------------- ------------------ ----------------------------------------------------------------------
enable on Turn the feature 'on' or 'off'. The default is 'on'.
home-agent
lifetime 0 Lifetime of a home agent in seconds
preference 0 Home agent's preference value that is used to order the addresses r...
[prefix] 2001:db8:1::100/32 IPv6 prefix configuration
router-advertisement
enable on Turn the feature 'on' or 'off'. The default is 'on'.
fast-retransmit off Allow consecutive RA packets more frequently than every 3 seconds
hop-limit 100 Value in hop count field in IP header of the outgoing router advert...
interval 6000 Maximum time in milliseconds allowed between sending unsolicited mu...
interval-option on Indicates hosts that the router will use advertisement interval to...
lifetime 4000 Maximum time in seconds that the router can be treated as default g...
managed-config on Knob to allow dynamic host to use managed (stateful) protocol for a...
other-config off Knob to allow dynamic host to use managed (stateful) protocol for a...
reachable-time 3600000 Time in milliseconds that a IPv6 node is considered reachable
retransmit-time 4294967295 Time in milliseconds between retransmission of neighbor solicitatio...
router-preference high Hosts use router preference in selection of the default router
To show prefix configuration for an interface, run the nv show interface <interface> ip neighbor-discovery prefix <prefix> command.
cumulus@leaf01:mgmt:~$ nv show interface swp1 ip neighbor-discovery prefix 2001:db8:1::100/32
applied description
------------------ ------- ----------------------------------------------------------------------
autoconfig on Indicates to hosts on the local link that the specified prefix can...
off-link on Indicates that adverisement makes no statement about on-link or off...
preferred-lifetime 1000000000 Time in seconds that addresses generated from a prefix remain prefe...
router-address on Indicates to hosts on the local link that the specified prefix cont...
valid-lifetime 2000000000 Time in seconds the prefix is valid for on-link determination
To show Home Agent configuration for an interface, run the nv show interface <interface> ip neighbor-discovery home-agent command:
cumulus@leaf01:mgmt:~$ nv show interface swp1 ip neighbor-discovery home-agent
applied description
---------- ------- ----------------------------------------------------------------------
lifetime 20000 Lifetime of a home agent in seconds
preference 100 Home agent's preference value that is used to order the addresses r...
To show router advertisement configuration for an interface, run the nv show interface <interface> ip neighbor-discovery router-advertisement command:
cumulus@leaf01:mgmt:~$ nv show interface swp1 ip neighbor-discovery router-advertisement
applied description
----------------- ------- ----------------------------------------------------------------------
enable on Turn the feature 'on' or 'off'. The default is 'on'.
fast-retransmit on Allow consecutive RA packets more frequently than every 3 seconds
hop-limit 64 Value in hop count field in IP header of the outgoing router advert...
interval 600000 Maximum time in milliseconds allowed between sending unsolicited mu...
interval-option on Indicates hosts that the router will use advertisement interval to...
lifetime 1800 Maximum time in seconds that the router can be treated as default g...
managed-config off Knob to allow dynamic host to use managed (stateful) protocol for a...
other-config off Knob to allow dynamic host to use managed (stateful) protocol for a...
reachable-time 0 Time in milliseconds that a IPv6 node is considered reachable
retransmit-time 0 Time in milliseconds between retransmission of neighbor solicitatio...
router-preference medium Hosts use router preference in selection of the default router
To show RDNSS configuration for an interface, run the nv show interface <interface> ip neighbor-discovery rdnss <address> command:
cumulus@leaf01:mgmt:~$ nv show interface swp1 ip neighbor-discovery rdnss 2001:db8:1::100
applied description
-------- -------- ----------------------------------------------------------------------
lifetime infinite Maximum time in seconds for which the server may be used for domain...
To show DNSSL configuration for an interface, run the nv show interface <interface> ip neighbor-discovery dnssl <domain-suffix> command:
cumulus@leaf01:mgmt:~$ nv show interface swp1 ip neighbor-discovery dnssl accounting.nvidia.com
applied description
-------- -------- ----------------------------------------------------------------------
lifetime infinite Maximum time in seconds for which the domain suffix may be used for...
Monitoring and Troubleshooting
This chapter introduces the basics for monitoring and troubleshooting Cumulus Linux.
Serial Console
Use the serial console to debug issues if you reboot the switch often or if you do not have a reliable network connection.
The default serial console baud rate is 115200, which is the baud rate ONIE uses.
Configure the Serial Console
On x86 switches, you configure serial console baud rate by editing grub.
Incorrect configuration settings in grub cause the switch to be inaccessible through the console. Review grub changes before you implement them.
The valid values for the baud rate are:
300
600
1200
2400
4800
9600
19200
38400
115200
To change the serial console baud rate:
Edit the /etc/default/grub file and provide a valid value for the --speed and console variables:
After you save your changes to the grub configuration, type the following at the command prompt:
cumulus@switch:~$ update-grub
If you plan on accessing the switch BIOS over the serial console, you need to update the baud rate in the switch BIOS. For more information, see this knowledge base article.
Reboot the switch.
Change the Console Log Level
By default, the console prints all log messages except debug messages. To tune console logging to be less verbose so that certain levels of messages do not print, run the dmesg -n <level> command, where the log levels are:
Level
Description
0
Emergency messages (the system is about to crash or is unstable).
1
Serious conditions; you must take action immediately.
2
Critical conditions (serious hardware or software failures).
3
Error conditions (often used by drivers to indicate difficulties with the hardware).
4
Warning messages (nothing serious but might indicate problems).
5
Message notifications for many conditions, including security events.
6
Informational messages.
7
Debug messages.
Only messages with a value lower than the level specified print to the console. For example, if you specify level 3, only level 2 (critical conditions), level 1 (serious conditions), and level 0 (emergency messages) print to the console:
cumulus@switch:~$ sudo dmesg -n 3
You can also run dmesg --console-level <level> command, where the log levels are emerg, alert, crit, err, warn, notice, info, or debug. For example, to print critical conditions, run the following command:
cumulus@switch:~$ sudo dmesg --console-level crit
The dmesg command applies until the next reboot.
For more details about the dmesg command, run man dmesg.
Show System Information
Cumulus Linux provides commands to obtain system information and to show the version of Cumulus Linux you are running. Use these commands when performing system diagnostics, troubleshooting performance, or submitting a support request.
To show information about the version of Cumulus Linux running on the switch, run the nv show system command:
cumulus@switch:~$ nv show system
operational applied pending description
-------- ------------------- ------- ------- ------------------------------
hostname leaf01 Static hostname for the switch
build Cumulus Linux 5.5.0 system build version
uptime 6 days, 22:03:49 system uptime
timezone Etc/UTC system time zone
To show system memory information in bytes, run the nv show system memory command:
cumulus@switch:~$ nv show system memory
Type Buffers Cache Free Total Used Utilization
-------- ---------- ----------- ----------- ------------ ----------- -----------
Physical 81661952 B 571834368 B 373276672 B 1813528576 B 786755584 B 79.4%
Swap 0 B 0 B 0 B 0.0%
To show system CPU information, run the nv show system cpu command:
cumulus@switch:~$ nv show system cpu
operational applied pending description
----------- ----------------------------- ------- ------- --------------------------------
core-count 1 Core count
model QEMU Virtual CPU version 2.5+ Model name
utilization 11.8% Utilization over 2 frames of top
To show general information about the switch, run the nv show platform hardware command:
cumulus@switch:~$ nv show platform hardware
operational applied pending description
------------- ---------------------------------- ------- ------- -----------------------------------------
asic-model Spectrum-2 System on Chip (SOC) model
asic-vendor Mellanox System On Chip (SOC) vendor
base-mac 1C:34:DA:26:DB:00 The base mac address provided by eeprom
cpu x86_64 Intel Xeon D D-1527 2.20GHz System CPU Arch
disk-size 28.00 GB Hard Disk Size
manufacturer Mellanox The platform's manufacturer
memory 15.29 GB Hardware RAM
model MSN3700 The platform's model identifier
onie-version 2019.11-5.2.0020-115200 onie version
part-number MSN3700-VS2FO System part number
platform-name x86_64-mlnx_msn3700-r0 Hardware platform name
port-layout 32 x 200G-QSFP56 System port layout
product-name MSN3700 Product Name
serial-number MT2046X13056 System serial number
system-mac 1c:34:da:26:db:fd The MAC provided by eeprom for system-mac
Diagnostics Using cl-support
You can use cl-support to generate a single export file that contains various details about switch configuration, and is useful for remote debugging and troubleshooting. For more information about cl-support, read Understanding the cl-support Output File.
Run cl-support to investigate issues before you submit a support request.
cumulus@switch:~$ sudo cl-support -h
Usage: [-h (help)] [-cDjlMsv] [-d m1,m2,...] [-e m1,m2,...]
[-p prefix] [-r reason] [-S dir] [-T Timeout_seconds] [-t tag]
-h: Display this help message
-c: Run only modules matching any core files, if no -e modules
-D: Display debugging information
-d: Disable (do not run) modules in this comma separated list
-e: Enable (only run) modules in this comma separated list; "-e all" runs
all modules and sub-modules, including all optional modules
...
Send Log Files to a syslog Server
You can configure Cumulus Linux to send log files to one or more remote syslog servers.
The following example configures Cumulus Linux to send log files to the remote syslog server with the 192.168.0.254 address in the default VRF on port 514 using UDP.
You must specify a VRF in the command.
cumulus@switch:~$ nv set service syslog default server 192.168.0.254 port 514
cumulus@switch:~$ nv set service syslog default server 192.168.0.254 protocol udp
cumulus@switch:~$ nv config apply
The configuration creates the /etc/rsyslog.d/11-remotesyslog-default.conf file. The file has the following content:
cumulus@switch:~$ sudo cat /etc/rsyslog.d/11-remotesyslog-default.conf
# Auto-generated by NVUE!
# Any local modifications will prevent NVUE from re-generating this file.
# md5sum: c8e094c868c7f9be4cfa6ccec752b44b
#
# Remote syslog servers configured through CUE
#
action(type="omfwd" Target="192.168.0.254" Port="514" Protocol="udp")
Log Technical Details
rsyslog performs logging on Cumulus Linux. rsyslog provides both local logging to the syslog file and the ability to export logs to an external syslog server. All rsyslog log files use high precision timestamps:
2015-08-14T18:21:43.337804+00:00 cumulus switchd[3629]: switchd.c:1409 switchd version 1.0-cl2.5+5
Cumulus Linux includes applications in the /var/log/ directory that write directly to a log file without going through rsyslog.
All Cumulus Linux rules are in separate files in /etc/rsyslog.d/, which rsyslog calls at the end of the GLOBAL DIRECTIVES section of the /etc/rsyslog.conf file. rsyslog ignores the RULES section at the end of the rsyslog.conf file; the rules in the /etc/rsyslog.d file must process the messages, which the last line in the /etc/rsyslog.d/99-syslog.conf file drops.
Local Logging
Cumulus Linux sends logs through rsyslog, which writes them to files in the /var/log directory. There are default rules in the /etc/rsyslog.d/ directory that define where the logs write:
Rule
Purpose
10-rules.conf
Sets defaults for log messages, include log format and log rate limits.
15-crit.conf
Logs crit, alert or emerg log messages to /var/log/crit.log to ensure they do not rotate away.
20-clagd.conf
Logs clagd messages to /var/log/clagd.log for MLAG.
22-linkstate.conf
Logs link state changes for all physical and logical network links to /var/log/linkstate.
Logs nvued messages to /var/log/nvued.log for NVUE.
45-frr.conf
Logs routing protocol messages to /var/log/frr/frr.log. This includes BGP and OSPF log messages.
50-netq-agent.conf
Logs NetQ agent messages to /var/log/netq-agent.log.
50-netqd.conf
Logs netqd messages to /var/log/netqd.log.
55-dhcpsnoop.conf
Logs DHCP snooping messages to /var/log/dhcpsnoop.log.
66-ptp4l.conf
Logs PTP messages to /var/log/ptp4l.log.
99-syslog.conf
Sends all remaining processes that use rsyslog to /var/log/syslog.
Cumulus Linux rotates and compresses log files into an archive. Processes that do not use rsyslog write to their own log files within the /var/log directory. For more information on specific log files, see Troubleshooting Log Files.
Enable Remote syslog
Cumulus Linux does not send all log messages to a remote server. To send other log files (such as switchd logs) to a syslog server, follow these steps:
Create a file in /etc/rsyslog.d/. Make sure the filename starts with a number lower than 99 so that it executes before log messages go in, such as 20-clagd.conf or 25-switchd.conf. The name of the example file below is /etc/rsyslog.d/11-remotesyslog.conf. Add content similar to the following:
## Logging switchd messages to remote syslog server
@192.168.1.2:514
This configuration sends log messages to a remote syslog server for the following processes: clagd, switchd, ptmd, rdnbrd, nvued and syslog. It follows the same syntax as the /var/log/syslog file, where @ indicates UDP, 192.168.12 is the IP address of the syslog server, and 514 is the UDP port.
For TCP-based syslog, use two @@ before the IP address @@192.168.1.2:514.
The file numbering in /etc/rsyslog.d/ dictates how the rules install into rsyslog.d. Lower numbered rules process first and rsyslog processing terminates with the stop keyword. For example, the rsyslog configuration for FRR is in the 45-frr.conf file with an explicit stop at the bottom of the file. FRR messages log to the /var/log/frr/frr.log file on the local disk only (these messages do not go to a remote server using the default configuration). To log FRR messages remotely in addition to writing FRR messages to the local disk, rename the 99-syslog.conf file to 11-remotesyslog.conf. The 11-remotesyslog.conf rule (transmit to remote server) processes FRR messages first, then the 45-frr.conf file continues to process the messages (write to local disk in the /var/log/frr/frr.log file).
Do not use the imfile module with any file written by rsyslogd.
You can write to syslog with management VRF enabled by applying the following configuration; the /etc/rsyslog.d/11-remotesyslog.conf file comments out this configuration.
cumulus@switch:~$ cat /etc/rsyslog.d/11-remotesyslog.conf
## Copy all messages to the remote syslog server at 192.168.0.254 port 514
action(type="omfwd" Target="192.168.0.254" Device="mgmt" Port="514" Protocol="udp")
For each syslog server, configure a unique action line. For example, to configure two syslog servers at 192.168.0.254 and 10.0.0.1:
cumulus@switch:~$ cat /etc/rsyslog.d/11-remotesyslog.conf
## Copy all messages to the remote syslog servers at 192.168.0.254 and 10.0.0.1 port 514
action(type="omfwd" Target="192.168.0.254" Device="mgmt" Port="514" Protocol="udp")
action(type="omfwd" Target="10.0.0.1" Device="mgmt" Port="514" Protocol="udp")
If you configure remote logging to use the TCP protocol, local logging might stop when the remote syslog server is unreachable. Also, if you configure remote logging to use the UDP protocol, local logging might stop if the UDP servers are unreachable because there are no routes available for the destination IP addresses.
To avoid this behavior, configure a disk queue size and maximum retry count in your rsyslog configuration:
If you want to limit the number of syslog messages that write to the syslog file from individual processes, add the following configuration to the /etc/rsyslog.conf file. Adjust the interval and burst values to rate-limit messages to the appropriate levels required by your environment. For more information, read the rsyslog documentation.
Harmless syslog Error: Failed to reset devices.list
The following message logs to /var/log/syslog when you run systemctl daemon-reload and during system boot:
systemd[1]: Failed to reset devices.list on /system.slice: Invalid argument
This message is harmless, you can ignore it. It logs when systemd attempts to change read-only group attributes. Cumulus Linux modifies the upstream version of systemd to not log this message by default.
The systemctl daemon-reload command runs when you install Debian packages. You see the message multiple times when upgrading packages.
Troubleshoot syslog
You can use the following commands to troubleshoot syslog issues.
Verifying that rsyslog is Running
To verify that the rsyslog service is running, use the sudo systemctl status rsyslog.service command:
cumulus@leaf01:mgmt-vrf:~$ sudo systemctl status rsyslog.service
rsyslog.service - System Logging Service
Loaded: loaded (/lib/systemd/system/rsyslog.service; enabled)
Active: active (running) since Sat 2017-12-09 00:48:58 UTC; 7min ago
Docs: man:rsyslogd(8)
http://www.rsyslog.com/doc/
Main PID: 11751 (rsyslogd)
CGroup: /system.slice/rsyslog.service
└─11751 /usr/sbin/rsyslogd -n
Dec 09 00:48:58 leaf01 systemd[1]: Started System Logging Service.
Verify your rsyslog Configuration
After making manual changes to any files in the /etc/rsyslog.d directory, use the sudo rsyslogd -N1 command to identify any errors in the configuration files that prevent the rsyslog service from starting.
In the following example, a closing parenthesis is missing in the 11-remotesyslog.conf file, which configures syslog for management VRF:
cumulus@leaf01:mgmt-vrf:~$ cat /etc/rsyslog.d/11-remotesyslog.conf
action(type="omfwd" Target="192.168.0.254" Device="mgmt" Port="514" Protocol="udp"
cumulus@leaf01:mgmt-vrf:~$ sudo rsyslogd -N1
rsyslogd: version 8.4.2, config validation run (level 1), master config /etc/rsyslog.conf
syslogd: error during parsing file /etc/rsyslog.d/15-crit.conf, on or before line 3: invalid character '$' in object definition - is there an invalid escape sequence somewhere? [try http: /www.rsyslog.com/e/2207 ]
rsyslogd: error during parsing file /etc/rsyslog.d/15-crit.conf, on or before line 3: syntax error on token 'crit_log' [try http://www.rsyslog.com/e/2207 ]
After correcting the invalid syntax, issuing the sudo rsyslogd -N1 command produces the following output.
cumulus@leaf01:mgmt-vrf:~$ cat /etc/rsyslog.d/11-remotesyslog.conf
action(type="omfwd" Target="192.168.0.254" Device="mgmt" Port="514" Protocol="udp")
cumulus@leaf01:mgmt-vrf:~$ sudo rsyslogd -N1
rsyslogd: version 8.4.2, config validation run (level 1), master config /etc/rsyslog.conf
rsyslogd: End of config validation run. Bye.
tcpdump
If a syslog server is not accessible to validate that syslog messages are exporting, you can use tcpdump.
In the following example, a syslog server uses 192.168.0.254 for UDP syslog messages on port 514:
cumulus@leaf01:mgmt-vrf:~$ sudo tcpdump -i eth0 host 192.168.0.254 and udp port 514
To generate syslog messages, use sudo in another session such as sudo date. Using sudo generates an authpriv log.
cumulus@leaf01:mgmt-vrf:~$ sudo tcpdump -i eth0 host 192.168.0.254 and udp port 514
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
00:57:15.356836 IP leaf01.lab.local.33875 > 192.168.0.254.syslog: SYSLOG authpriv.notice, length: 105
00:57:15.364346 IP leaf01.lab.local.33875 > 192.168.0.254.syslog: SYSLOG authpriv.info, length: 103
00:57:15.369476 IP leaf01.lab.local.33875 > 192.168.0.254.syslog: SYSLOG authpriv.info, length: 85
To see the contents of the syslog file, use the tcpdump -X option:
You can monitor system hardware with the following commands and utilities:
decode-syseeprom
smond
sensors
watchdog
decode-syseeprom Command
Use the decode-syseeprom command to retrieve information about the switch EEPROM. If the EEPROM is writable, you can set values on the EEPROM.
The following is example decode-syseeprom command output. The output is different on different switches:
cumulus@switch:~$ decode-syseeprom
TlvInfo Header:
Id String: TlvInfo
Version: 1
Total Length: 629
TLV Name Code Len Value
-------------------- ---- --- -----
Product Name 0x21 64 MSN3700C
Part Number 0x22 20 MSN3700-CSBFO
Serial Number 0x23 24 MT2043X05294
Base MAC Address 0x24 6 1C:34:DA:24:C9:00
Manufacture Date 0x25 19 10/21/2020 20:57:29
Device Version 0x26 1 1
MAC Addresses 0x2A 2 254
Manufacturer 0x2B 8 Mellanox
Vendor Extension 0xFD 52 0x00 0x00 0x81 0x19 0x00 0x2E 0x00 0x02 0x07 0x98 0x00 0x00 0x31 0x00 0x20 0x00 0x00 0x00 0x00 0x00 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07 0x07
Platform Name 0x28 64 x86_64-mlnx_msn3700C-r0
ONIE Version 0x29 23 2019.11-5.2.0020-115200
CRC-32 0xFE 4 0x11D0954D
(checksum valid)
The decode-syseeprom command includes the following options:
Option
Description
-h, -help
Displays the help message and exits.
-a
Prints the base MAC address for switch interfaces.
-r
Prints the number of MAC addresses allocated for the switch interfaces.
-s
Sets the EEPROM content (if the EEPROM is writable). You can provide arguments in the command line in a comma separated list in the form <field>=<value>.
., and = are not allowed in field names and values.
Any field not specified defaults to the current value.
NVIDIA Spectrum switches do not support this option.
-j, --json
Displays JSON output.
-t <target>
Prints the target EEPROM information (board, psu2, psu1).
--serial, -e
Prints the device serial number.
-m
Prints the base MAC address for the management interfaces.
--init
Clears and initializes the board EEPROM cache.
Run the dmidecode command to retrieve hardware configuration information populated in the BIOS.
Run apt-get to install the lshw program on the switch, which also retrieves hardware configuration information.
smond Daemon
The smond daemon monitors system units like power supply and fan, updates the corresponding LEDs, and logs the change in state. The cpld registers detect changes in system unit state. smond utilizes these registers to read all sources, which determines the health of the unit and updates the system LEDs.
Run the sudo smonctl command to display sensor information for the various system units:
cumulus@switch:~$ sudo smonctl
Board : OK
Fan : OK
PSU1 : OK
PSU2 : BAD
Temp1 (Networking ASIC Die Temp Sensor ): OK
Temp10 (Right side of the board ): OK
Temp2 (Near the CPU (Right) ): OK
Temp3 (Top right corner ): OK
Temp4 (Right side of Networking ASIC ): OK
Temp5 (Middle of the board ): OK
Temp6 (P2020 CPU die sensor ): OK
Temp7 (Left side of the board ): OK
Temp8 (Left side of the board ): OK
Temp9 (Right side of the board ): OK
When the switch is not powered on, smonctl shows the PSU status as BAD instead of POWERED OFF or NOT DETECTED. This is a known limitation.
The smonctl command includes the following options:
Option
Description
-s <sensor>, --sensor <sensor>
Displays data for the specified sensor.
-v, --verbose
Displays detailed hardware sensors data.
For more information, read man smond and man smonctl.
sensors Command
Run the sensors command to monitor the health of your switch hardware, such as power, temperature and fan speeds. This command executes lm-sensors.
Even though you can use the sensors command to monitor the health of your switch hardware, the smond daemon is the recommended method for monitoring hardware health. See smond Daemon
above.
For example:
cumulus@switch:~$ sensors
tmp75-i2c-6-48
Adapter: i2c-1-mux (chan_id 0)
temp1: +39.0 C (high = +75.0 C, hyst = +25.0 C)
tmp75-i2c-6-49
Adapter: i2c-1-mux (chan_id 0)
temp1: +35.5 C (high = +75.0 C, hyst = +25.0 C)
ltc4215-i2c-7-40
Adapter: i2c-1-mux (chan_id 1)
in1: +11.87 V
in2: +11.98 V
power1: 12.98 W
curr1: +1.09 A
max6651-i2c-8-48
Adapter: i2c-1-mux (chan_id 2)
fan1: 13320 RPM (div = 1)
fan2: 13560 RPM
Output from the sensors command varies depending upon the switch.
If you only plug in one PSU, the fan is at maximum speed.
The following table shows the sensors command options.
Option
Description
-c --config-file
Specify a configuration file; use - after -c to read the configuration file from stdin; by default, sensors references the configuration file in /etc/sensors.d/.
-s --set
Execute set statements in the configuration file (root only); sensors -s runs one time at boot and applies all the settings to the boot drivers.
-f --fahrenheit
Show temperatures in degrees Fahrenheit.
-A --no-adapter -A --bus-list
Do not show the adapter for each chip. Generate bus statements for sensors.conf.
-u
Generate raw output.
-j
Generate json output.
-v
Show the program version.
Hardware Watchdog
Cumulus Linux includes a simplified version of the wd_keepalive(8) daemon instead of the one in the standard watchdog Debian package. wd_keepalive writes to a file called /dev/watchdog periodically (at least one time per minute) to prevent the switch from resetting. Each write delays the reboot time by another minute. After one minute of inactivity, where wd_keepalive does not write to /dev/watchdog, the switch resets itself.
Cumulus Linux enables the watchdog by default, which starts when you boot the switch (before switchd starts).
To disable the watchdog, disable and stop the wd_keepalive service:
Data centers today have a large number of network switches manufactured by different hardware vendors running network operating systems from different providers. This section provides a set of guidelines for how network port and status LEDs appear on the front panel of a network switch. This provides you with a standard way to identify the state of a switch and its ports by looking at its front panel, irrespective of the hardware vendor or NOS.
Network Port LEDs
A network port LED indicates the state of the link, such as link UP or transmit and receive activity. Here are the requirements for these LEDs:
Number of LEDs per port - Ports that you cannot split; for example, 1G ports must have 1 LED per port. Ports that you can split have 1 LED per split port. For example, a 40G port that you can split into 4 10G ports has 4 LEDs, one per split port.
Location - A port LED must be right above the port. This prevents drooping cables from hiding them. If you can split the port, the LED for each split port must also be above the port. The LEDs must be evenly spaced and inside the edges of the ports to prevent confusion.
Port Number Label - You must print the port number in white on the switch front panel directly under the corresponding LED.
Colors - As network port technology improves with smaller ports and higher speeds, having different colors for different types of ports or speeds is confusing. Focus on providing a simple set of indications that show basic information about the port. Use green and amber colors on the LED to differentiate between good and bad states. These colors are commonly on network port LEDs and you can implement them easily on future switches.
Signaling - The table below shows the information you can convey with port LEDs.
Max Speed indicates the maximum speed at which the port can run. For a 10G port, if the port speed is 10G, then it is running at its maximum speed. If the 10G port is running at 1G speed, then it is running at a lower speed.
Physical Link Up/Down displays layer 2 link status.
Beaconing provides a way for you to identify a particular link. You can beacon that port from a remote location so the network operator has visual indication for that port.
Fault is also a form of beaconing. Both try to draw attention towards the port.
Blinking amber implies a blink rate of 33ms. Slow blinking amber indicates a blink rate of 500ms, with a 50% on and off duty cycle. For example, a slow blinking amber LED is amber for 500 ms and then off for 500ms.
| Activity | Max Speed indication | Lower Speed Indication |
| ------------------- | -------------------- | ---------------------- |
| Physical Link Down | Off | Off |
| Physical Link UP | Solid Green | Solid Amber |
| Link Tx/Rx Activity | Blinking Green | Blinking Amber |
| Beaconing | Slow Blinking Amber | Slow Blinking Amber |
| Fault | Slow Blinking Amber | Slow Blinking Amber |
Status LEDs
One side of a network switch has a set of status LEDs. The status LEDs provide a visual indication on what is physically wrong with the network switch. Typical LEDs on the front panel are for PSUs (power supply units), fans, and system. Locator LEDs are also on the front panel of a switch. Each component that has an LED is a unit.
Number of LEDs per unit - Each unit must have only 1 LED.
Location - All units must have their LEDs on the right-hand side of the switch after the physical ports.
Unit label - You must print the label on the front panel directly above the LED.
Colors - Provide a simple set of indications that show basic information about the unit. The following section has more information about the indications, but the standard colors are green and amber. You find these colors universally on all status LEDs.
Defined LED - You must have LEDs for the following on every network switch:
PSU
Fans
System LED
Locator LED
PSU LEDs - Each PSU must have its own LED. PSU faults are difficult to debug. If you know which PSU is faulty, you can quickly check if it powers up correctly and, if that fault persists, replace the PSU.
Unit Activity
Indication
Installed and power OK
Solid Green
Installed, but no power
Slow Blinking Amber
Installed, powered, but has faults.
Slow Blinking Amber
Fan LED - A network switch might have multiple fan trays (3 through 6). It is difficult to put an LED for each fan tray on the front panel given the limited real estate. The recommendation is one LED for all fans.
Unit Activity
Indication
All fans running OK
Solid Green
Fault on any one of the fans.
Slow Blinking Amber
System LED - A network switch must have a system LED that indicates the general state of a switch. This state can be of hardware, software, or both. It is up to the individual switch NOS to decide what this LED indicates. However, the LED can have only the following indications:
Unit Activity
Indication
All OK
Solid Green
Not OK
Slow Blinking Amber
Locator LED - The locator LED helps locate a particular switch in a data center full of switches. The LED must have a different color and predefined location. It must be at the top right corner on the front panel of the switch and its color must be blue.
Unit Activity
Indication
Locate enabled
Blinking Blue
Locate disabled
Off
Understanding the cl-support Output File
The cl-support script generates a compressed archive file of useful information for troubleshooting. The system either creates the archive file automatically or you can create the archive file manually.
Automatic cl-support File
The system creates the cl-support archive file automatically for the following reasons:
When there is a core dump file for any application (not specific to Cumulus Linux, but something all Linux distributions support), located in /var/support/core.
When one of the monitored services fails for the first time after you reboot or power cycle the switch.
Manual cl-support File
To create the cl-support archive file manually, run the cl-support command:
cumulus@switch:~$ sudo cl-support
If the Cumulus Linux support team requests that you submit the output from cl-support to investigate issues you experience and you need to include security-sensitive information, such as the sudoers file, use the -s option:
cumulus@switch:~$ sudo cl-support -s
For information on the directories included in the cl-support archive, see:
Troubleshooting Log Files. This guide highlights the most important log files to inspect. Keep in mind, cl-support includes all log files.
Troubleshooting Log Files
The only real unique entity for logging on Cumulus Linux compared to any other Linux distribution is switchd.log, which logs the HAL (hardware abstraction layer) from hardware.
Information from the apt utility. For example, from apt-get install and apt-get remove.
/var/log/audit/*
Information stored by the Linux audit daemon, auditd.
/var/log/autoprovision
Output generated by running the zero touch provisioning script (ZTP).
/var/log/boot.log
Information that the system logs when the switch boots.
/var/log/btmp
Information about failed login attempts. Use the last command to view the btmp file. For example:
cumulus@switch:~$ last -f /var/log/btmp | more
/var/log/clagd.log
Status of the clagd service.
/var/log/dpkg.log
Information that the system logs when you install or remove a package with the dpkg command.
/var/log/frr/*
FRR - Used to troubleshoots routing, such as an MD5 or MTU mismatch with OSPF.
/var/log/gunicorn
Error and access events in Gunicorn.
/var/log/installer/*
Directory containing files related to the installation of Cumulus Linux.
/var/log/lastlog
Formats and prints the contents of the last login log file.
/var/log/nvued.log
Log file for NVUE.
/var/log/nginx
Errors and processed requests in NGINX.
/var/log/ntpstats
Logs for network configuration protocol.
/var/log/openvswitch/*
ovsdb-server logs.
/var/log/ptmd
Prescriptive Topology Manager (PTM) errors and information.
/var/log/switchd.log
The HAL log for Cumulus Linux. This is specific to Cumulus Linux. The system logs switchd crashes here.
/var/log/syslog
The main system log, which logs everything except auth-related messages. The primary log; grep this file to see what problem occurred.
/var/log/wtmp
Login records file.
Troubleshooting the etc Directory
The cl-support script replicates the /etc directory, however, it excludes certain files, such as /etc/nologin, which prevents unprivileged users from logging into the system.
The following list shows the output from ls -l on the /etc directory structure, which cl-support creates.
etc Directory Contents
File
acpi
adduser.conf
alternatives
apm
apparmor
apparmor.d
apt
audisp
audit
bash.bashrc
bash_completion
bash_completion.d
bindresvport.blacklist
binfmt.d
ca-certificates
ca-certificates.conf
calendar
console-setup
containerd
cron.d
cron.daily
cron.hourly
cron.monthly
crontab
cron.weekly
cruft
cumulus
dbus-1
debconf.conf
debian_version
debsums-ignore
default
deluser.conf
dhcp
dhcpsnoop
discover.conf.d
discover-modprobe.conf
dnsmasq.conf
dnsmasq.d
docker
dpkg
e2fsck.conf
emacs
environment
etckeeper
ethertypes
fonts
freeipmi
frr
fstab
gai.conf
groff
group
group-
grub.d
gshadow
gshadow-
gss
gunicorn.conf.py
hdparm.conf
host.conf
hostname
hosts
hosts.allow
hosts.deny
hsflowd
hsflowd.conf
hw_init.d
image-release
init
init.d
initramfs-tools
inputrc
insserv
insserv.conf
insserv.conf.d
iproute2
issue
issue.net
kernel
ldap
ld.so.cache
ld.so.conf
ld.so.conf.d
libaudit.conf
libnl
linuxptp
lldpd.d
locale.alias
locale.gen
localtime
logcheck
login.defs
login.defs.cumulus
login.defs.cumulus-orig
logrotate.conf
logrotate.conf.cumulus
logrotate.conf.cumulus-orig
logrotate.d
lsb-release
lttng
lvm
machine-id
magic
magic.mime
mailcap
mailcap.order
manpath.config
mime.types
mke2fs.conf
mlx
modprobe.d
modules
modules-load.d
motd
motd.distrib
mtab
mysql
nanorc
netd.conf
netq
network
NetworkManager
networks
nginx
nsswitch.conf
ntp.conf
nvue-auth.yaml
nvue.d
openvswitch
opt
os-release
pam.conf
pam.d
passwd
passwd-
perl
profile
profile.cumulus
profile.cumulus-orig
profile.d
protocols
ptm.d
ptp4l.conf
python
python2.7
python3
python3.7
ras
rdnbrd.conf
resolv.conf
resolvconf
resolv.conf.bak
restapi.conf
rmt
rpc
rsyslog.conf
rsyslog.conf.cumulus
rsyslog.conf.cumulus-orig
rsyslog.d
runit
screenrc
securetty
security
selinux
sensors3.conf
sensors.d
services
sgml
shadow
shadow-
shells
skel
smartd.conf
smartmontools
smi.conf
snmp
ssh
ssl
subgid
subgid-
subuid
subuid-
sudoers
sudoers.d
sv
sysctl.conf
sysctl.d
systemd
terminfo
timezone
tmpfiles.d
ucf.conf
udev
ufw
update-motd.d
vim
vrf
watchdog.conf
wgetrc
what-just-happened
wireshark
X11
xattr.conf
xdg
xml
Troubleshooting Network Interfaces
The following sections describe various ways you can troubleshoot ifupdown2 and network interfaces.
Enable Network Logging
To obtain verbose logs when you run systemctl [start|restart] networking.service as well as when the switch boots, create an overrides file with the systemctl edit networking.service command and add the following lines:
Use ifquery --check to check the current running state of an interface within the interfaces file. The command returns exit code 0 or 1 if the configuration does not match. The line bond-xmit-hash-policy layer3+7 below fails because it should read bond-xmit-hash-policy layer3+4.
ifquery --syntax-help provides help on all possible attributes supported in the interfaces file. For complete syntax on the interfaces file, see man interfaces and man ifupdown-addons-interfaces.
You can use ifquery --print-savedstate to check the ifupdown2 state database. ifdown works only on interfaces present in this state database.
An easy way to debug and get details about template errors is to use the mako-render command on your interfaces template file or on /etc/network/interfaces itself.
cumulus@switch:~$ sudo mako-render /etc/network/interfaces
# This file describes the network interfaces available on your system
# and how to activate them. For more information, see interfaces(5).
# The loopback network interface
auto lo
iface lo inet loopback
# The primary network interface
auto eth0
iface eth0 inet dhcp
#auto eth1
#iface eth1 inet dhcp
# Include any platform-specific interface configuration
source /etc/network/interfaces.d/*.if
# ssim2 added
auto swp45
iface swp45
auto swp46
iface swp46
cumulus@switch:~$ sudo mako-render /etc/network/interfaces.d/<interfaces_stub_file>
ifdown Cannot Find an Interface that Exists
If you try to bring down an interface that you know exists, use ifdown with the --use-current-config option to force ifdown to check the current /etc/network/interfaces file to find the interface. For example:
cumulus@switch:~$ sudo ifdown br0
error: cannot find interfaces: br0 (interface was probably never up ?)
cumulus@switch:~$ sudo brctl show
bridge name bridge id STP enabled interfaces
br0 8000.44383900279f yes downlink
peerlink
cumulus@switch:~$ sudo ifdown br0 --use-current-config
Remove All References to a Child Interface
If you have a configuration with a child interface, whether it is a VLAN, bond, or another physical interface and you remove that interface from a running configuration, you must remove every reference to it in the configuration. Otherwise, the parent interface continues to use the interface.
For example, consider the following configuration:
auto lo
iface lo inet loopback
auto eth0
iface eth0 inet dhcp
auto bond1
iface bond1
bond-slaves swp2 swp1
auto bond3
iface bond3
bond-slaves swp8 swp6 swp7
auto br0
iface br0
bridge-ports swp3 swp5 bond1 swp4 bond3
bridge-pathcosts swp3=4 swp5=4 swp4=4
address 11.0.0.10/24
address 2001::10/64
bond1 is a member of br0. If you remove bond1, you must remove the reference to it from the br0 configuration. Otherwise, if you reload the configuration with ifreload -a, bond1 remains part of br0.
MTU Set on a Logical Interface Fails with Error: “Numerical result out of range”
This error occurs when the MTU you are trying to set on an interface is higher than the MTU of the lower interface or dependent interface. Linux expects the upper interface to have an MTU less than or equal to the MTU on the lower interface.
In the example below, the swp1.100 VLAN interface is an upper interface to physical interface swp1. If you want to change the MTU to 9000 on the VLAN interface, you must include the new MTU on the lower interface swp1 as well.
auto swp1.100
iface swp1.100
mtu 9000
auto swp1
iface swp1
mtu 9000
iproute2 batch Command Failures
ifupdown2 batches iproute2 commands for performance reasons. A batch command contains ip -force -batch - in the error message. The command number that failed is at the end of this line: Command failed -:1.
Below is a sample error for the command 1: link set dev host2 master bridge. There was an error adding the bond host2 to the bridge named bridge because host2 did not have a valid address.
error: failed to execute cmd 'ip -force -batch - [link set dev host2 master bridge
addr flush dev host2
link set dev host1 master bridge
addr flush dev host1
]'(RTNETLINK answers: Invalid argument
Command failed -:1)
warning: bridge configuration failed (missing ports)
“RTNETLINK answers: Invalid argument” Error when Adding a Port to a Bridge
This error can occur when the bridge port does not have a valid hardware address.
This can occur when the interface you add to the bridge is an incomplete bond; a bond without slaves is incomplete and does not have a valid hardware address.
MLAG Interface Drops Packets
Losing a large number of packets across an MLAG peerlink interface is often not a problem. This can occur to prevent BUM (broadcast, unknown unicast and multicast) packet looping. For more details, and for information on how to detect these drops, read the MLAG chapter.
Troubleshoot Layer 1
This chapter describes how to troubleshoot layer 1 issues that can affect the port modules connecting a switch to a network.
High Speed Ethernet Technologies
Specifications
The following specifications are useful in understanding and troubleshooting layer 1 problems:
IEEE 802.3 specifications define the technologies and link characteristics of the various types and speeds of Ethernet technologies.
SFF MSA specifications define the SFP and QSFP module hardware construction and feature implementation specifics.
Form Factors
Modern Ethernet modules come in one of two form factors:
Small Form factor Pluggables (SFP)
Quad Small Form factor Pluggables (QSFP)
Each form factor contains an EEPROM with information about the capabilities of the module and various groups of required or optional registers to query or control aspects of the module. The output from the ethtool -m <swp> command decodes the main values.
The SFF MSA specifications define the memory locations for the fields in the EEPROM and the common registers:
SFP: SFF-8472: Management Interface for SFP+ (PDF)
QSFP: SFF-8636Management Interface for 4-lane Modules and Cables (PDF)
Identifiers used in the first byte of the module memory map:
0x03: SFP/SFP+/SFP28 - One 10G or 25G lane - Small Form factor Pluggable
0x11: QSFP28 - Four 25G or 50G lanes (100G or 200G total) - Quad SFP with 25G or 50G lanes
0x18: QSFP-DD - Eight 50G lanes (400G total) - Quad SFP with a recessed extra card-edge connector to enable 8 lanes of 50G
Encoding
Two parts of high-speed Ethernet are under encoding in the output from the ethtool -m <swp> command:
The way to represent bits on the wire or optical fiber:
NRZ (also known as PAM2) has two voltage (or laser) levels signifying a 0 and 1. Negative voltage level = 0, positive voltage = 1. Used on all Ethernet technologies below lane speeds of 50Gbps.
PAM4 encoding has four signal levels signifying two bits (00,01,10,11). These levels are much closer together than the levels in NRZ/PAM2, so signal integrity is an even greater concern. PAM4 encoding is used on 50G lanes.
The way the bits are joined together into a frame to be able to facilitate functions like clock recovery and error checking and correction:
64B/66B: Two control bits with a 1/0 transition are added to the front of a 64 bit frame to ensure that the clock can sync every 66 bits.
RS-FEC encoding (528,514) or (544,514): Uses 14 or 30 bits to enable correction of bit errors on the receive side (see the FEC section below).
BaseR FEC encoding: Steals a bit in the 64B/66B control word to re-encode and add a smaller level of error correction than RS-FEC.
The relationship between lane speed and encoding methods is described in this table:
Lane Speed
Encoding
10G
Uses 64B/66B framing then encoded in NRZ — actually 10.3125 Gbps on the wire.
25G
Uses 64B/66B framing then encoded in NRZ — actually 25.78125 Gbps on the wire. Can also use RS-FEC (528,514) or Base-R FEC.
50G
Uses PAM4 encoding and RS-FEC (544,514).
The SerDes (Serial/Deserializer) is the component in the port that converts byte data to and from a set of bit streams (lanes), where:
SFP ports use 1 lane
QSFP ports use 4 lanes
QSFP-DD ports use 8 lanes
On the ASIC, the 40G, 100G and 200G SerDes devices are 4 lanes; 400G SerDes uses 8 lanes. So an SFP port is actually one lane on a four lane SerDes. Depending on the platform design, this sometimes affects how you can configure and break out SFP ports.
Port speeds are created using the following formulas:
Port Speed
Number of Lanes
1G
One 10G or 25G lane clocked at 1G. Or, on a 1G fixed copper switch, a 1G lane.
10G
One 10G lane.
25G
One 25G lane.
40G
Four 10G lanes.
50G
Two 25G lanes (NRZ) or one 50G lane (PAM4).
100G
Four 25G lanes (100G-SR4/CR4 NRZ) or two 50G lanes (100G-CR2 PAM4).
200G
Four 50G lanes.
400G
Eight 50G lanes.
Active and Passive Modules and Cables
From the point of view of the port, modules and cables can be classified as either active or passive.
Active cables and modules contain transmitters that regenerate the bit signals over the cable. All optical modules are active. 10/100/1000BaseT and 10GBaseT are active modules and contain an onboard PHY that handles the BaseT auto-negotiation and TX/RX to the remote BaseT device. For active modules, the port only has to provide a TX signal with a base level of power to the module and the module uses the power it receives on the port power bus to regenerate the signal to the remote side.
Although some copper cable assemblies are active, they are extremely rare.
Passive cables (copper DACs) connect the port side of the module directly to the copper twinax media on the other side of the module in the assembly. The port TX lines provide the power to drive the signal to the remote end. The port goes through a training sequence with the remote end port to tune the power TX and RX parameters to optimize the received signal and ensure correct clock and data recovery at each RX end.
Compliance Codes, Ethernet Type, Ethmode Type, Interface Type
Compliance codes, Ethernet type, Ethmode type, and interface type are all terms for the type of Ethernet technology that the module implements.
For the port to know the characteristics of the module that is inserted, the SFP or QSFP module EEPROMs have a standardized set of data to describe the module characteristics. These values appear in the output of ethtool -m <swp>.
The compliance codes describe the type of Ethernet technology the module implements, such as 1000Base-T, 10GBase-SR, 10GBase-CR, 40GBase-SR4, and 100GBase-CR4.
The first part of the compliance code gives the full line rate speed of the technology. The last part of the compliance code specifies the Ethernet technology and the number of lanes used:
-T: Twisted pair.
CR: Copper twinax (passive DAC). CR4 uses a bundle of 4 twinax cables for 4 lanes, CR2 uses 2 cables, CR uses 1.
SR: Optical short range. SR4 uses a bundle of 4 fibers for 4 lanes.
LR: Optical long range. LR4 uses 4 wavelengths over one fiber pair to transmit 4 lanes over long distances (kilometers).
xWDM (SWDM, CWDM, DWDM): Optical wavelength multiplexed technologies (various). Multiple lanes are transmitted by different wavelengths.
An active module with a passive module compliance code or a passive module with an active module compliance code causes the port to be set up incorrectly and may affect signal integrity.
Some modules have vendor specific coding, are older, or use a proprietary vendor technology that is not listed in the standards. As a result, they are not recognized by default and need to be overridden to the correct compliance code. On NVIDIA switches, the port firmware automatically overrides certain supported modules to the correct compliance code.
Digital Diagnostic Monitoring/Digital Optical Monitoring (DDM/DOM)
DDM/DOM is an optional capability that vendors can implement on their optical transceivers to display measurements about the optical power. The values are generally reliable within a 10% tolerance. A value of 0.0000 generally indicates the value is not implemented by the vendor.
The most useful DDM/DOM values when troubleshooting a problem link are:
RX optical power (receiver signal average optical power)
TX optical power (laser output power)
The location of DDM/DOM fields are standardized. If DDM/DOM capability is present on a module, the values are displayed in the output of ethtool -m <swp>.
For each DDM/DOM value there can be thresholds to mark a high or low warning or an alarm when the value exceeds that threshold.
An alarm value indicates the level required for the signal to be within the vendor’s design tolerance, and the warning level is a little bit closer to expected norms.
When a warning or alarm is triggered, the flag flips from Off to On. Reading that value with ethtool -m or NVIDIA NetQ (or some other monitoring software) resets this flag back to Off after it is read.
Auto-negotiation
There are 3 different types of auto-negotiation (IEEE 802.3 clauses 28, 37, 73), which apply to various Ethernet technologies that Cumulus Linux supports:
10/100/1000/10GBASE-T (twisted pair, clause 28): The original Ethernet auto-negotiation, which negotiates speed and duplex (full/half), and flow control (link pause) on full-duplex. Mandatory for 1G/10G data rates over twisted pair.
1000BASE-X (optical, clause 37): Detects unidirectional link conditions (no RX on one side). If a unidirectional link condition occurs, clause 37 auto-negotiation signals the port to bring the link down; this avoids blackholing traffic.
40G/100G/25G/50G/200G/400GBASE-CR (DAC, clause 73): Negotiates speed, performs link training to improve the bit error rate (BER), and negotiates FEC.
Many Ethernet technologies used in Cumulus Linux switches do not have auto-negotiation capability:
No 10G DAC or optical link has auto-negotiation; only 10GBASE-T and backplane links have auto-negotiation standards. Backplane links do not exist on Cumulus Linux switch ports.
No optical links, except 1G optical, have auto-negotiation.
Only about half of all modern link types support auto-negotiation. The next subsections provide guidance on when and how to enable auto-negotiation.
Ethernet Link Types and Auto-negotiation
1000BASE-T and 10GBASE-T fixed copper ports require auto-negotiation for 1G and 10G speeds. This is the default setting; you cannot disable auto-negotiation for 1G speeds. Disabling auto-negotiation on these ports requires setting the speed to 100Mbps or 10Mbps and the correct duplex setting.
1000BASE-T SFPs have an onboard PHY that performs auto-negotiation automatically on the RJ45 side without involving the port. Do not change the default auto-negotiation setting on these ports; on NVIDIA switches, auto-negotiation is ON.
For 1000BASE-X, auto-negotiation is highly recommended on 1G optical links to detect unidirectional link failures.
For all other optical modules except for 1000BASE-X, there is no auto-negotiation standard.
For 10G DACs, there is no auto-negotiation.
For DAC cables on speeds higher than 25G, auto-negotiation is unnecessary, but is useful because it can improve signal integrity by link training. It also negotiates speed and FEC, which is less useful because the neighbor speed and FEC is usually known.
General Auto-negotiation Guidance
When auto-negotiation is supported on an Ethernet type, both sides of the link must be configured with the same auto-negotiation setting.
Cumulus Linux sets a default for auto-negotiation and speed, duplex, and FEC for each port based on the ASIC and port speed. On NVIDIA platforms running Cumulus Linux, auto-negotiation defaults to ON.
If auto-negotiation is OFF — which is called force mode — then speed, duplex and FEC must also be specified if a non-default value for the port is desired. If auto-negotiation is ON, then speed, duplex and FEC should not be specified. The only exception to this is for 1000BASE-X optical interfaces, where speed is 1000 and auto-negotiation is ON to get unidirectional link detection.
If auto-negotiation is enabled on a link type that does not support auto-negotiation, the port enters autodetect mode (see the next section) to determine the most likely speed and FEC settings to bring the link up. This feature is usually successful, but if the link does not come up, it might be necessary to disable auto-negotiation and set these link settings manually.
There is no concept of auto-negotiation of MTU. To change the MTU from the default setting, configure it explicitly.
Generally, you can ignore the duplex setting, except on 10M/100M links. The default is Full. Although you can configure the duplex setting, it has no option except Full on speeds higher than 1G and is auto-negotiated on 1000BASE-T links.
Autodetect
As a result of the confusion about when auto-negotiation applies to a link type, many Ethernet software vendors, including Cumulus Linux, allow auto-negotiation ON to be configured on every interface type. When auto-negotiation is ON, but is not supported on a link type, the port software tries to determine the most likely link settings to bring the link up. Cumulus Linux calls this feature autodetect, but it is not directly configurable.
When auto-negotiation is enabled on a port, the behavior is as follows:
When auto-negotiation is ON, the port is always in autodetect mode. The port steps through a list of possible auto-negotiation, speed and FEC settings for the port and module combination until the link comes up. The default configuration is auto-negotiation ON.
Autodetect is a local feature. The neighbor is assumed to either be configured with auto-negotiation off and speed, duplex, and FEC set manually, or using some equivalent algorithm to determine the correct speed, duplex, and FEC settings.
To see the user configured settings for auto-negotiation, speed, duplex and FEC compared to the actual operational state on the port hardware, use the l1-show command.
The autodetect feature is usually successful, but if the link does not come up, disable auto-negotiation and configure the link settings manually.
FEC
Forward Error Correction (FEC) is an algorithm used to correct bit errors along a medium. FEC encodes the data stream so that the remote device can correct a certain number of bit errors by decoding the stream.
The target IEEE bit error rate (BER) in high-speed Ethernet is 10-12. At 25G lane speeds and above, this might not be achievable without error correction, depending on the media type and length. See Switch Port Attributes for a more detailed discussion of FEC requirements for certain cable types.
Both sides of a link must have the same FEC encoding algorithm enabled for the link to come up. If both sides appear to have a working signal path but the link is down, there might be an auto-negotiation mismatch or FEC mismatch in the configuration.
FEC Encoding Algorithms and Settings
The Reed-Solomon RS-FEC(528,514) algorithm adds 14 bits of encoding information to a 514 bit stream. It replaces and uses the same amount of overhead in the 64B/66B encoding so that the bit rate is not affected. The algorithm can correct 7 bit errors in a 514 bit stream. RS(528,514) is used on 25G (NRZ) lanes, including 25G, 50G-CR2, and 100G-SR4/CR4 interfaces.
The Reed-Solomon RS-FEC(544,514) algorithm adds 30 bits of overhead to correct 14+ bit errors per 514 bits. FEC RS is required on 50G (PAM4) lanes; all 200G, 400G, 100G-CR2 and 50G-CR interfaces.
Base-R (also known as FireCode/FC) FEC adds 32 bits per 32 blocks of 64B/66B to correct 11 bits per 2048 bits. It replaces one bit per block, so it uses the same amount of overhead as 64B/66B encoding. It is used in 25G interfaces only. The algorithm executes faster than the RS-FEC algorithm, so latency is reduced. Both RS-FEC and Base-R FEC are implemented in hardware.
None/Off: FEC is optional and is often useful on 25G lanes, which includes 100G-SR4/CR4 and 50G-CR2 links. If the cable quality is good enough to achieve a BER of 10-12 without FEC, then there is no reason to enable it. 10G/40G links should never require FEC. If a 10G/40G link has errors, replace the cable or module that is causing the error.
Auto: FEC can be auto-negotiated between 2 devices. When auto-negotiation is ON, the default FEC setting is auto to enable FEC capability information to be sent and received with the neighbor. The port FEC active/operational setting is set to the result of the negotiation. Auto is the default setting on NVIDIA switches (auto-negotiation ON is the default setting).
If auto-negotiation is disabled on 100G and 25G interfaces, you must set FEC to OFF*, RS, or BaseR to match the neighbor. The FEC default setting of *auto* does not link up when auto-negotiation is disabled.
In some cases, the configured value might be different than the operational value. In such cases, the l1-show command displays both values. For example:
Configured: Auto, Operational: RS.
Configured: RS, but link is down, so Operational is: None/Off
Signal Integrity
The goal of Ethernet protocols and technologies is to enable the bits generated on one side of a link to be received correctly on the other side. The next two sections provide information about what might be happening on the link level when the link is down or bits are not received correctly.
Link State: RX Power, Signal Detection, Signal Lock, Carrier Detection, RX Fault
Various characteristics show the state of a link. All characteristics might not be available to display on all platforms.
RX power: On optical modules with DDM/DOM capabilities, the module shows the power level of the received signal. Note that a module can receive a signal with plenty of power, but still not be able to recover the data from a signal because it is distorted.
Signal detected: A signal is received from the remote device on the local port receiver.
Signal lock: The local port receiver is locked onto a good signal that is received from the remote side.
Carrier detected: Both ends of the link are able to understand the data being sent to them. The link should be up on both sides.
RX fault (None, Local, Remote or Local/Remote): The local end or the remote end is alerting that it is not receiving and/or understanding a good signal and bit stream.
Local fault indicates the local end does not have signal lock or cannot understand the data sent to it on its RX path.
Remote fault indicates that the local end RX path has signal lock and can understand the bit stream from the neighbor, but the remote neighbor is sending alerts over that working path indicating that it has no signal lock or cannot understand the data sent to it over its own RX path.
Eyes
When a 1 or a 0 bit is transmitted across a link, it is represented on the electrical side of the port as either a high voltage level or a low voltage level. If an oscilloscope is attached to those leads, as the bit stream is transmitted across it, the transitions between 1 and 0 form a pattern in the shape of an eye.
The farther the distance between the 1 and 0, the more open the eye appears. The more open the eye is, the less likely it is for a bit to be misread. When a bit is misread, it causes a bit-error, which results in an FCS error on the entire packet being received. A lower eye measurement generally translates to a larger bit error rate (BER). FEC can correct bit errors up to a point.
Eyes are not measured on fixed copper ports and are not measured when a link is down.
Each hardware vendor implements some quantitative measurement of eyes and some kind of qualitative measurement.
On an NVIDIA switch, the eyes are assigned a height in mV and a grade. For speeds below 100G (NRZ encoding), when the grade goes below 4000, the error rate or stability of the link might be negatively impacted.
A link might have no stability problems with a measurement below these values, and FEC might correct all errors presented on such a link. For some interface types, FEC is required to remove errors up to BER levels that are expected on the media.
For 50G lanes (200G- and 400G-capable ports), the link uses PAM4 encoding, which has 3 eyes stacked on top of each other and therefore much smaller eye measurements. FEC is required on these links.
l1-show Command
Because Linux Ethernet tools do not have a unified approach to the various vendor driver implementations and areas that affect layer 1, Cumulus Linux uses the l1-show command to show all layer 1 aspects of a Cumulus Linux port and link.
You must run the l1-show command as root. The syntax for the command is:
cumulus@switch:~$ sudo l1-show PORTLIST
Here is the output from the NVIDIA SN2410 switch on the other side of the same link:
cumulus@2410-switch:~$ sudo l1-show swp43
Port: swp43
Module Info
Vendor Name: Mellanox PN: MCP2M00-A003
Identifier: 0x03 (SFP) Type: 25g-cr
Configured State
Admin: Admin Up Speed: 25G MTU: 9216
Autoneg: On FEC: Auto
Operational State
Link Status: Kernel: Up Hardware: Up
Speed: Kernel: 25G Hardware: 25G
Autoneg: On (Autodetect enabld) FEC: None
TX Power (mW): None
RX Power (mW): None
Topo File Neighbor: bcm-switch-1, swp43
LLDP Neighbor: bcm-switch-1, swp43
Port Hardware State:
Compliance Code: 100GBASE-CR4 or 25GBASE-CR CA-L
Cable Type: Passive copper cable
Speed: 25G Autodetect: Enabled
Eyes: 79 Grade: 5451
Troubleshooting Info: No issue was observed.
The output is in the following sections:
Module Info: Shows basic information about the module, according to the module EEPROM.
Configured State: Shows configuration information of the port, as defined in the kernel.
Operational State: Shows high level details of the actual link status of the port in the hardware and kernel.
Port Hardware State: Shows low level port information from the port on the switch ASIC.
The configured state reflects the configuration you apply to the kernel with ifupdown2. The switchd daemon translates the kernel state to the platform hardware state and keeps them in sync.
Configured State
Admin: Admin Up Speed: 25G MTU: 9216
Autoneg: On FEC: Auto
Admin state:
Admin Up means the kernel has enabled the port with NVUE, ifupdown2, or temporarily with ip set line <swp> up.
Admin Down means the kernel has disabled the port.
Speed:
The configured speed in the kernel.
You can lower the speed with NVUE or ifupdown2.
If you enable auto-negotiation, this output displays the negotiated or auto-detected speed.
MTU: The configured MTU setting in the kernel.
Autoneg: The configured auto-negotiation state in the kernel. See Auto-negotiation for more information.
FEC: The configured state of FEC in the kernel. See FEC, above for more information.
Operational State
The operational state shows the current state of the link in the kernel and in the switch hardware.
Operational State
Link Status: Kernel: Up Hardware: Up
Speed: Kernel: 25G Hardware: 25G
Autoneg: On (Autodetect enabld) FEC: None
TX Power (mW): None
RX Power (mW): None
Topo File Neighbor: switch-1, swp43
LLDP Neighbor: switch-1, swp43
Link Status and Speed: The kernel state and hardware state matches, unless the link is in some unstable or transitory state.
Autoneg and Autodetect: See Auto-negotiation above for more information.
FEC: The operational state of FEC on the link. See FEC above for more information.
TX Power and RX Power: These values come from the module DDM/DOM fields to indicate the optical power strength measured on the module if the module implements the feature. The switch supports, both, only RX, or neither. This does not apply to DAC and twisted pair interfaces.
Topo File Neighbor: If you populate the /etc/ptm.d/topology.dot file and the ptmd daemon is active, the entry matching this interface shows.
LLDP Neighbor: If the lldpd daemon is running and the switch receives LLDP data from the neighbor, the neighbor information shows here.
Port Hardware State
The port hardware state shows additional low level port information. The output varies between vendors.
Here is the output on NVIDIA platforms:
Port Hardware State:
Compliance Code: 100GBASE-CR4 or 25GBASE-CR CA-L
Cable Type: Passive copper cable
Speed: 25G Autodetect: Enabled
Eyes: 79 Grade: 5451
Troubleshooting Info: No issue was observed.
The NVIDIA port firmware automatically troubleshoots link problems and displays items of concern in the Troubleshooting Info section of this output.
Design a test that best displays the lowest level indicator of that problem behavior. The hierarchy view of l1-show is often the best tool to find this indicator.
Make changes based on the problem type that leads toward isolating the root cause of the failure. Use the test to track progress.
Identify if the issue is likely a configuration issue or a hardware issue. If unclear, start with configuration first.
For configuration issues, ensure the configuration on both ends of the link matches the guidance in this guide and in Switch Port Attributes.
For hardware issues, isolate the faulty component by methodically moving and replacing components as described in Isolate Faulty Hardware below.
After you isolate the root cause, make the changes permanent to resolve the problem. For faulty hardware, replace the failed component.
See the sections below for specific guidance for each problem type.
Isolate Faulty Hardware
When you suspect that one of the components in a link is faulty, use the following approach to determine which component is faulty.
First, identify the faulty behavior at the lowest level possible, then design a test that best displays that behavior. Use the hierarchy output of l1-show to find the best indicator. Here are some examples of tests you can use:
No RX power: Examine the RX power in the Operational State section of the l1-show output.
Remote side is sending RX Faults: Check the Troubleshooting info for neighbor is sending remote faults.
Errors on link when FEC is not required: Examine the HwIfInErrors counters in ethtool -S <swp> to see if they are incrementing over time.
Try swapping the modules and fibers to determine which component is bad:
Swap the DAC, AOC or fiber patch cables along the path with known good cables. Does the test indicate that the symptoms change?
Swap the modules between the local and remote. Does the test indicate the symptoms move with the module or stay on the same neighbor?
Loopback tests: Move one of the modules to the neighbor and connect the two modules back-to-back in the same switch, ideally with the same cable. What does the test indicate now? Now, move both modules to the other side and repeat. Try to isolate the issue to a single fiber, module, port, platform or configuration.
Replace each module one at a time with a different module of the same type; the current module could be bad.
Replace each module with a different module from a different vendor. Use a module that the Cumulus Linux switch supports.
Troubleshoot Down or Flapping Links
A down or flapping link can exhibit any or all the following symptoms:
l1-show returns Link Status: Kernel: Down and Hardware: Down for the operational state.
ip link show <swp> returns <NO-CARRIER,BROADCAST,MULTICAST,UP>. An up link returns something like <BROADCAST,MULTICAST,UP,LOWER_UP>.
ip link show changes every second or two, indicating the link is flapping up or down.
Log messages in /var/log/linkstate indicate the carrier is flapping up or down.
The switch does not receive any LLDP data, or the link is flapping.
To begin troubleshooting, examine the output of l1-show on both ends of the link if possible. The output contains all the pertinent information to help troubleshoot the link.
cumulus@switch~$ sudo l1-show swp10
Port: swp10
Module Info
Vendor Name: FINISAR CORP. PN: FTLX8574D3BCL
Identifier: 0x03 (SFP) Type: 10g-sr
Configured State
Admin: Admin Up Speed: 10G MTU: 9216
Autoneg: On FEC: Auto
Operational State
Link Status: Kernel: Up Hardware: Up
Speed: Kernel: 10G Hardware: 10G
Autoneg: On (Autodetect enabld) FEC: None
TX Power (mW): [0.5267]
RX Power (mW): [0.5427]
Topo File Neighbor: qct-ix8-51, swp3
LLDP Neighbor: qct-ix8-51, swp3
Port Hardware State:
Compliance Code: 10G Base-SR
Cable Type: Optical Module (separated)
Speed: 10G Autodetect: Enabled
Eyes: 411 Grade: 41609
Troubleshooting Info: No issue was observed.
Working from top to bottom of the l1-show output on both sides of the link, ask the questions listed below.
Does the module vendor name and vendor part number match the module that connects to the switch? Is this port the correct port with the link problem? Is the right module installed?
Does the device at the remote end recognize the module as the same Ethernet type that the local switch recognizes? Vendor outputs differ. If the remote device is not a Cumulus Linux device, consult the vendor documentation to determine how to display the Ethernet type for the installed module.
Examine Configured State
Configured State
Admin: Admin Up Speed: 10G MTU: 9216
Autoneg: On FEC: Auto
Admin: Is the link Admin Up? Is the link configured and enabled?
Speed: Is the configured speed correct? Does it match the configured speed on the remote side?
MTU: Does the MTU match on both sides? Note that an MTU mismatch does not prevent the link from coming up, but it does affect traffic forwarding.
Autoneg: Does this setting match the configuration and is it what you expect? See Auto-negotiation.
FEC: Is the FEC setting correctly configured? See FEC.
Examine Operational State
Operational State
Link Status: Kernel: Up Hardware: Up
Speed: Kernel: 10G Hardware: 10G
Autoneg: On (Autodetect enabld) FEC: None
TX Power (mW): [0.5267]
RX Power (mW): [0.5427]
Topo File Neighbor: qct-ix8-51, swp3
LLDP Neighbor: qct-ix8-51, swp3
Link status (Kernel and Hardware): What is the current state of both?
Normally these values should be in sync.
When troubleshooting a link down issue, one or both of these values is down (usually both).
In a link flapping issue, one or both of these values might change every second or less, so the output of this field might not represent the value in the next moment in time.
Speed (Kernel and Hardware): Does the operational speed match the configured speed?
When the link is up, the kernel and hardware operational values should be in sync with each other and the configured speed.
When the link is down and auto-negotiation is on, the kernel value is Unknown! because the hardware does not synchronize to a speed.
When the link is down and auto-negotiation is off, the kernel speed displays the configured value. The Hardware field shows various values, depending on implementation of the particular hardware interface.
Autoneg and Autodetect: Normally the operational value matches the configured value. This is informational only, but it is useful to know if autodetect is on. See auto-negotiation and autodetect.
FEC: This field is only useful for informational purposes. It displays the actual FEC only when the link is up.
When the link is down, the operational FEC is None.
When the link is up, this field shows the actual FEC value working on the link.
TX Power and RX Power (optical modules only): If the module supports laser power DDM/DOM, are these values in working ranges?
Check the Laser send power high/low alarm and Laser receive power high/low warning thresholds in the output of ethtool -m <swp> to see what the expected low and high values are.
A short range module should send in the range of 0.6 and 1.0 mW and should work with receive power in the range of > 0.05 mW.
Long range optical modules have TX power above 1.0 mW.
When in doubt, consult the technical specifications for the particular module.
A value of 0.0000 or 0.0 indicates that the module does not support DDM/DOM TX or RX power, or the module is not transmitting or receiving a signal.
If the TX Power is 0.0000 or 0.0, then either the module does not support TX DDM/DOM or the module lasers are off for some reason.
If the RX Power is 0.0000 or 0.0, then either the module does not support RX DDM/DOM or the module receivers are not receiving a signal.
A value of 0.0001 indicates that the module supports DDM/DOM, but the module is not receiving or transmitting a signal.
If the TX Power is 0.0001, then the module lasers are likely disabled for some reason.
If the RX Power is 0.0001, then the module receivers are not receiving a signal.
On a QSFP module, check the value for each of the four lanes. Sometimes only one lane is failing and the entire link is down as a result.
Topo File Neighbor: If you configure a ptmd topology file on the switch, you can identify the link neighbor you expect.
LLDP Neighbor: Does this match the expected neighbor and port?
This value is the neighbor and port that LLDP reports.
If the link is down, this value is normally blank.
Examine Port Hardware State
The following values come from the NVIDIA port firmware:
What does the firmware assess as the problem? Although this information is at the end of the output, this is sometimes the first place to look for basic guidance.
Examples:
The port is closed by command. Please check that the interface is enabled. Configure the port so it is Admin Up.
The cable is unplugged. The firmware does not detect a module. Check to see if a module is in this port, or reseat the module.
Auto-negotiation no partner detected. The link is down because it does not see the neighbor. This is not very helpful to determine the cause alone.
Check the configurations on both sides for an auto-negotiation or FEC configuration mismatch.
If the link is a fiber link and the module supports RX/TX Power DDM/DOM, check the RX Power and TX Power values in the Operational State output of l1-show to help determine which component fails.
Follow the steps in Isolate Faulty Hardware; use this value or the RX/TX Power DDM/DOM value as the test.
Force Mode no partner detected. Auto-negotiation or autodetect is off, link is down because it does not see the neighbor. This is not very helpful to determine the cause alone.
Check the configurations on both sides for a speed, auto-negotiation, or FEC configuration mismatch.
If the link is a fiber link and the module supports RX/TX Power DDM/DOM, check the RX Power and TX Power values in the Operational State output of l1-show to help determine which component fails.
Follow the steps in Isolate Faulty Hardware; use this value or the RX/TX Power DDM/DOM value as the test.
Neighbor is sending remote faults. This end of the link is receiving data from the neighbor, but the neighbor is not receiving recognizable data from the local port. See RX Fault in Signal Integrity above for details. The local device is not transmitting, the remote receiver is not receiving recognizable data or is receiving broken data.
RX Signal Failure Examples
Here is the output from l1-show for an AOC (on swp6) with failed RX on lane 3. Because an AOC is an integrated fiber assembly, you must replace the entire assembly.
Port: swp6
Module Info
Vendor Name: XXXXX PN: AOC-XXXX
Identifier: 0x0d (QSFP+) Type: 40g-sr4
Configured State
Admin: Admin Up Speed: 40G MTU: 9216
Autoneg: Off FEC: Off
Operational State
Link Status: Kernel: Down Hardware: Down <=Link is down, Kernel and Hardware
Speed: Kernel: 40G Hardware: 40G
Autoneg: Off FEC: None (down)
TX Power (mW): [1.1645, 1.171, 1.1155, 1.0945]
RX Power (mW): [0.159, 0.1732, 0.153, 0.0067] <=Low power on lane 3
Topo File Neighbor: switch_1, swp6
LLDP Neighbor: None, None
Port Hardware State:
Rx Fault: Local <=Local RX Failed Carrier Detect: no <=No bi-directional communication
Rx Signal: Detect: YYYY Signal Lock: YYYN <=No signal lock on lane 3
Ethmode Type: 40g-sr4 Interface Type: SR4
Speed: 40G Autoneg: Off
MDIX: ForcedNormal, Normal FEC: Off
Local Advrtsd: None Remote Advrtsd: None
Eyes: L: 357, R: 326, U: 211, D: 219, L: 328, R: 312, U: 206, D: 211,
L: 359, R: 343, U: 211, D: 200, L: 0, R: 0, U: 0, D: 0 <= No valid eye on lane 3
Here is the l1-show output for an AOC with failed lanes 0 and 1. Note that signal lock is bouncing, and sometimes shows Y. You must replace the AOC.
Port: swp8
Module Info
Vendor Name: XXXX PN: AOC-XXXX
Identifier: 0x0d (QSFP+) Type: 40g-sr4
Configured State
Admin: Admin Up Speed: 40G MTU: 9216
Autoneg: Off FEC: Off
Operational State
Link Status: Kernel: Down Hardware: Down <=Link is down, Kernel and Hardware
Speed: Kernel: 40G Hardware: 40G
Autoneg: Off FEC: None (down)
TX Power (mW): [1.1762, 1.1827, 1.1272, 1.1062]
RX Power (mW): [0.0001, 0.0001, 0.5255, 0.64] <=Low power on lanes 0,1
Topo File Neighbor: switch_2, swp10
LLDP Neighbor: None, None
Port Hardware State:
Rx Fault: Local <=Local RX Failed Carrier Detect: no <=No bi-directional communication
Rx Signal: Detect: YYYY Signal Lock: YNYY <=No lock on lane 1 at moment of capture
Ethmode Type: 40g-sr4 Interface Type: SR4
Speed: 40G Autoneg: Off
MDIX: ForcedNormal, Normal FEC: Off
Local Advrtsd: None Remote Advrtsd: None
Eyes: L: 0, R: 0, U: 0, D: 0, L: 0, R: 0, U: 0, D: 0, <=No valid eyes on lanes 0,1
L: 359, R: 359, U: 214, D: 226, L: 359, R: 359, U: 243, D: 264
Troubleshoot Physical Errors on a Link
Physical errors on a link occur if you have signal integrity issues or you do not configure the required FEC type on a particular module or cable type.
The target bit error rate (BER) in high-speed Ethernet is 10-12. When BER exceeds this value, either configure the correct FEC setting or replace a marginal module, cable, or fiber patch. If the resulting BER on a link with correctly configured FEC is still unacceptable, you need to replace one of the hardware components in the link to resolve the errors.
To see error counters for a port, run the ethtool -S <swp> | grep Errors command. If FEC is on, these counters only count errors that FEC does not correct.
On NVIDIA switches, to see the bit error count that FEC corrects on a link, run the sudo l1-show <swp> --pcs-errors command.
Because errors can occur during link up and down transitions, it is best to check error counters over a period of time to ensure they are incrementing regularly instead of displaying stale counts from when the link last transitions up or down. The /var/log/linkstate log files show historical link up and link down transitions on a switch.
Troubleshoot Signal Integrity Issues
Signal integrity issues are often a root cause of different types of symptoms:
If the signal integrity is very poor or non-existent, the link stays down.
If the signal integrity is too marginal, the link might flap with or without FEC on.
If the signal integrity is marginal, the link might display physical error counts. Depending on the link speed and cable type, the module or cable can have some margin of error in signal integrity. In these cases, use FEC to correct errors to reach the target IEEE bit error rate (BER) of 10-12 on the link. See FEC for guidance.
If FEC is on and the bit stream does not recover acceptably, the link stays down. If the signal integrity is marginal, but bad enough that FEC cannot correct an acceptable rate of errors, the link flaps when FEC signals a restart of the link to attempt to restore an acceptable bit stream.
To see error counters for a port, run the ethtool -S <swp> | grep Errors command. If FEC is on, these counters only count errors that FEC does not correct.
To see counts of bit errors that FEC corrects on a link, run the sudo l1-show <swp> --pcs-errors command.
Signal integrity issues are physical issues and usually, you must replace some hardware component in the link to fix the link. Follow the steps in Isolate Faulty Hardware to isolate and replace the failed hardware component.
On rare occasions, if the switch does not recognize a module correctly and is the wrong type (active instead of passive), it can cause a signal integrity issue.
Usually there is an MTU size mismatch when higher layer protocols like OSPF adjacencies fail or you lose non-fragmentable packets. Generally, an MTU settings mismatch does not affect link operational status.
To troubleshoot a suspected MTU problem, review the Configured State section in the output of l1-show:
Configured State
Admin: Admin Up Speed: 10G MTU: 9216 <===
Autoneg: On FEC: Auto
Compare the MTU configuration with that of the neighbor. Do they match?
Note that different vendors sometimes have different interpretations on the MTU value and might vary by a few bytes from another vendor. Research the vendor documentation to determine if you need to adjust this value in the configuration to match a neighbor.
Troubleshoot High Power Module Issues
The SFF specifications allow for modules of different power consumption levels along with a request and grant procedure to enable higher levels.
An SFP module can have 3 different power classes:
1.0W
1.5W
2.0W
Cumulus Linux enables power class 2 (1.5W) by default. All Cumulus Linux switches support 1.5W across all SFP ports simultaneously.
A QSFP module can have 8 different power classes:
1.5W
2.0W
2.5W
3.5W
4.0W
4.5W
5.0W
10.0W
Low power mode is power class 1 (1.5W). This is the state during initial boot.
After hardware initialization, Cumulus Linux enables normal power mode on QSFP modules by default — power classes 2-4, 2.0W to 3.5W.
All Cumulus Linux switches support 3.5W across all QSFP ports simultaneously.
Some modules require high power modes for driving long distance lasers. Power classes 5-8 — 4.0W, 4.5W, 5.0W, 10.0W — are high power modes. If a module needs a high power mode, it can request it, which the switch grants if the port supports it.
To determine if a switch supports higher power modes, consult the hardware manufacturer specifications for power limitations for a switch.
NVIDIA switches vary in their support of high power modules. For example, on some NVIDIA Spectrum 1 switches, only the first and last two QSFP ports support up to QSFP power class 6 (4.5W) and only the first and last two SFP ports support SFP power class 3 (2.0W) modules. Other Spectrum 1 switches do not support high power ports at all. Consult the hardware manufacturer specifications for exact details of which ports support high power modules.
The total bus power rating is the default power rating per port type (SFP: 1.5W, QSFP: 3.5W) multiplied by the number of ports of each type present on the bus.
To see the requested and enabled status for high power module, review the output of sudo ethtool -m. The following output is from a device of power class between 1 and 4 (1.5W to 3.5W). The module does not request a high power class or the switch does not enable it.
cumulus@switch:mgmt:~# sudo ethtool -m swp53
Identifier : 0x11 (QSFP28)
Extended identifier : 0x00
Extended identifier description : 1.5W max. Power consumption <= ignore for high power modules
Extended identifier description : No CDR in TX, No CDR in RX
Extended identifier description : High Power Class (> 3.5 W) not enabled <= high power mode not requested or enabled
The following is the output from a power class 7 (5.0W) module. The module is requesting power class 7, but the switch does not support or enable it. The switch only supports power class 6 on this port.
cumulus@switch:mgmt:~# sudo ethtool -m swp49
[sudo] password for cumulus:
Identifier : 0x11 (QSFP28)
Extended identifier : 0xcf
Extended identifier description : 3.5W max. Power consumption <= ignore for high power modules
Extended identifier description : CDR present in TX, CDR present in RX
Extended identifier description : 5.0W max. Power consumption, High Power Class (> 3.5 W) not enabled <= Request 5.0W, not enabled
The following is the output from a power class 6 (4.5W) module. The module is requesting power class 6 and the switch enables it.
cumulus@switch:mgmt:~# sudo ethtool -m swp3
Identifier : 0x11 (QSFP28)
Extended identifier : 0xce
Extended identifier description : 3.5W max. Power consumption <= ignore for high power modules
Extended identifier description : CDR present in TX, CDR present in RX
Extended identifier description : 4.5W max. Power consumption, High Power Class (> 3.5 W) enabled <= Request 4.5W, enabled
Troubleshoot I2C issues
Ethernet switches contain multiple I2C buses set up for the switch CPU to communicate low speed control information with the port modules, fans, and power supplies within the system.
On rare occasions, a port module with a defective I2C component or firmware can fail and lock up one or more I2C buses. Depending on the particular hardware design of a switch and the way in which the failure occurs, different symptoms of this failure display. Often traffic continues to work for a while in this failed condition, but sometimes the failure can cause modules to be incorrectly configured, resulting in link failures or increased error rates on a link. In the worst case, the switch reboots or locks up.
Because I2C issues are in the low speed control circuitry of a module, high speed traffic rates do not affect the data side of the module. Software bugs in Cumulus Linux do not cause these issues.
When the I2C bus has issues or lockups, installed port modules might no longer show up in the output of sudo l1-show <swp> or sudo ethtool -m <swp>. A significant number of smbus or i2c or EEPROM read errors might be present in /var/log/syslog. After one module locks up the bus, some or all the other modules then exhibit problems, making it nearly impossible to tell which module is causing the failure.
Failed I2C components or defective designs in port modules cause an overwhelming number of I2C lockups. Low priced vendor modules cause most failures, but even high price, high quality modules can fail, only with much lower incidence; they have a higher MTBF rating.
You might resolve the issue if you remove each port module one by one until the problem clears; this might indicate which module causes the failure. However, often the bus blocks in a way that requires a reboot or power cycle to clear the I2C failure. Clearing the failure in one of these ways works for a while, but when the conditions are right again, hours, days or months later, the marginal I2C component might fail again and lock up the switch.
In the worst situations, a switch might have multiple bad or marginal I2C modules from the same vendor batch, making it difficult to determine which module or modules are bad.
Because I2C problems can be very pernicious, often showing up again much later after the problem clears, deal with them quickly and forcefully.
To verify that an I2C failure is occurring, run sudo tail -F /var/log/syslog and look for smbus or i2c or EEPROM read errors that continue to appear or appear in bursts.
Based on the failure scenario when you discover the issue, choose when to address this issue; immediately or during a maintenance window.
If traffic or the switch operates negatively and you cannot route traffic through a redundant network, you must do something immediately.
If you can route traffic around the failing switch, allowing troubleshooting to proceed on the failed switch, proceed to reroute traffic to find an appropriate time to troubleshoot the failing switch.
To troubleshoot the failure and restore the switch to working, use the following options according to the urgency of the situation:
Remove port modules one-by-one to see if the condition clears. This provides low probability of clearing the I2C failure, but possibly provides lower impact to traffic. If successful, this approach might reveal the problem module.
Restart switchd by running the sudo systemctl reset-failed ; sudo systemctl restart switchd command. Verify the condition clears after the restart completes. This provides a medium probability of clearing the I2C failure and a medium impact to traffic. It does not provide a way to discover which module failed.
Reboot the switch and verify the condition clears after the reboot completes. This provides a high probability of clearing the I2C failure, but also a high impact to traffic. It does not provide a way to discover which module failed.
Power cycle the switch and verify the condition clears after the reboot completes. This provides a very high probability of clearing the I2C failure, but also a very high impact to traffic. It does not provide a way to discover which module failed.
If the I2C failure recurs soon after a power cycle, you need to combine a binary strategy of removing half the modules at a time and power cycling.
If after removing all the modules, and power cycling, the I2C errors are still occurring, the next step is to remove each power supply and fan one by one between power cycles to see if one of those devices is blocking the I2C bus.
If you remove all modules and each power supply and you test the fan independently, and the I2C failures are still occurring, then the final step is to replace the switch.
If the switch is operational again due to one of the above methods, but you have not identified the module that caused, try the following approach:
If there is a history in the syslog files of occasional errors on one module in advance of the failure, remove or replace that module first.
Replace any module that has caused problems in the past.
NVUE enables you to check the status of an interface, and view and clear interface counters. Interface counters provide information about an interface, such as the number of packets dropped, the number of inbound and outbound packets not transmitted because of errors, and so on.
Monitor Interface Status
To check the configuration and statistics for an interface, run the nv show interface <interface> command:
cumulus@switch:~$ nv show interface swp1
operational applied pending
------------------------ ----------------- ------- -------
type swp swp
[acl]
evpn
multihoming
uplink off
ptp
enable off
router
adaptive-routing
enable off
ospf
enable on
area none
cost auto
mtu-ignore off
network-type broadcast
passive on
priority 1
authentication
enable off
bfd
enable off
timers
dead-interval 40
hello-interval 10
retransmit-interval 5
transmit-delay 1
ospf6
enable off
pbr
[map]
pim
enable off
synce
enable off
ip
igmp
enable off
ipv4
forward on
ipv6
enable on
forward on
neighbor-discovery
enable on
[dnssl]
home-agent
enable off
[prefix]
[rdnss]
router-advertisement
enable on
fast-retransmit on
hop-limit 64
interval 600000
interval-option off
lifetime 1800
managed-config off
other-config off
reachable-time 0
retransmit-time 0
router-preference medium
vrrp
enable off
vrf default
[gateway]
link
auto-negotiate off on
duplex full full
speed 1G auto
fec auto
mtu 9216 9216
[breakout]
state up up
stats
carrier-transitions 4
in-bytes 300 Bytes
in-drops 5
in-errors 0
in-pkts 5
out-bytes 9.73 MB ß
out-drops 0
out-errors 0
out-pkts 140188
mac 48:b0:2d:ef:52:b8
ifindex 3
Show Interface Counters
NVUE provides the following commands to show counters (statistics) for the interfaces on the switch.
NVUE Command
Description
nv show interface --view counters
Shows all statistics for all the interfaces configured on the switch, such as the total number of received and transmitted packets, and the number of received and transmitted dropped packets and error packets.
nv show interface <interface> counters
Shows all statistics for a specific interface, such as the number of received and transmitted unicast, multicast and broadcast packets, the number of received and transmitted dropped packets and error packets, and the number of received and transmitted packets of a certain size.
nv show interface <interface> counters errors
Shows the number of error packets for a specific interface, such as the number of received and transmitted packet alignment, oversize, undersize, and jabber errors.
nv show interface <interface> counters drops
Shows the number of received and transmitted packet drops for a specific interface, such as ACL drops, buffer drops, queue drops, and non-queue drops.
nv show interface <interface> counters pktdist
Shows the number of received and transmitted packets of a certain size for a specific interface.
nv show interface <interface> counters qos
Shows QoS statistics for the specified interface. See Show Qos Counters.
nv show interface <interface> counters ptp
Shows PTP statistics for a specific interface. See Show PTP Counters.
The following example shows all statistics for all the interfaces configured on the switch:
NVUE does not show detailed statistics for logical interfaces, such as bonds, VLAN interfaces or subinterfaces. To see basic statistics for logical interfaces, run the nv show interface <interface> link stats command.
On Mellanox switches, Cumulus Linux updates physical counters to the kernel every two seconds and virtual interfaces (such as VLAN interfaces) every ten seconds. You cannot change these values. Because the update process takes a lower priority than other switchd processes, the interval might be longer when the system is under a heavy load.
Clear Interface Counters
To clear counters (statistics) for all interfaces, run the nv action clear interface counters command:
The nv action clear interface <interface> counters command does not clear counters in the hardware.
Monitoring Interfaces and Transceivers with ethtool
The ethtool command enables you to query or control the network driver and hardware settings and takes the device name, such as swp1, as an argument. When the device name is the only argument, ethtool prints the network device settings. See man ethtool(8) for details.
NVIDIA recommends using the l1-show command to monitor Ethernet data; refer to Troubleshoot Layer 1.
Monitor Interface Status
To check the status of an interface, run the ethtool <interface> command:
cumulus@switch:~$ ethtool swp1
Settings for swp1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseT/Full
10000baseT/Full
Supported pause frame use: No
Supports auto-negotiation: No
Advertised link modes: 1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: No
Speed: 10000Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 0
Transceiver: external
Auto-negotiation: off
Current message level: 0x00000000 (0)
Link detected: yes
The switch hardware includes the active port settings. The output of ethtool <interface> shows the port settings in the kernel. The switchd process keeps the hardware and kernel in sync for the important port settings (speed, auto-negotiation, and link detected). However, some fields in ethtool, such as Supported Link Modes and Advertised Link Modes, do not update based on the actual module in the port and might show incorrect or misleading results.
To query interface statistics, run the ethtool -S <interface> command:
Interface counters provide information about an interface. You can view this information when you run cl-netstat, ifconfig, or cat /proc/net/dev. You can also run sudo cl-netstat -c to save or clear the interface counters.
To see the cl-netstat command options, run the cl-netstat -h command.
Some services, such as MLAG and DHCP can cause drop counters to increment as expected and do not cause a problem on the switch.
Monitor Switch Port SFP and QSFP Hardware Information
To see hardware capabilities and measurement information on the SFP or QSFP module in a particular port, use the ethtool -m command. If the SFP or QSFP supports Digital Optical Monitoring (the Optical diagnostics support field is Yes in the output below), the optical power levels and thresholds also show below the standard hardware details.
In the sample output below, you can see that this module is a 1000BASE-SX short-range optical module, manufactured by JDSU, part number PLRXPL-VI-S24-22. The second half of the output displays the current readings of the Tx power levels (Laser output power) and Rx power (Receiver signal average optical power), temperature, voltage and alarm threshold settings.
cumulus@switch$ ethtool -m swp3
Identifier : 0x03 (SFP)
Extended identifier : 0x04 (GBIC/SFP defined by 2-wire interface ID)
Connector : 0x07 (LC)
Transceiver codes : 0x00 0x00 0x00 0x01 0x20 0x40 0x0c 0x05
Transceiver type : Ethernet: 1000BASE-SX
Transceiver type : FC: intermediate distance (I)
Transceiver type : FC: Shortwave laser w/o OFC (SN)
Transceiver type : FC: Multimode, 62.5um (M6)
Transceiver type : FC: Multimode, 50um (M5)
Transceiver type : FC: 200 MBytes/sec
Transceiver type : FC: 100 MBytes/sec
Encoding : 0x01 (8B/10B)
BR, Nominal : 2100MBd
Rate identifier : 0x00 (unspecified)
Length (SMF,km) : 0km
Length (SMF) : 0m
Length (50um) : 300m
Length (62.5um) : 150m
Length (Copper) : 0m
Length (OM3) : 0m
Laser wavelength : 850nm
Vendor name : JDSU
Vendor OUI : 00:01:9c
Vendor PN : PLRXPL-VI-S24-22
Vendor rev : 1
Optical diagnostics support : Yes
Laser bias current : 21.348 mA
Laser output power : 0.3186 mW / -4.97 dBm
Receiver signal average optical power : 0.3195 mW / -4.96 dBm
Module temperature : 41.70 degrees C / 107.05 degrees F
Module voltage : 3.2947 V
Alarm/warning flags implemented : Yes
Laser bias current high alarm : Off
Laser bias current low alarm : Off
Laser bias current high warning : Off
Laser bias current low warning : Off
Laser output power high alarm : Off
Laser output power low alarm : Off
Laser output power high warning : Off
Laser output power low warning : Off
Module temperature high alarm : Off
Module temperature low alarm : Off
Module temperature high warning : Off
Module temperature low warning : Off
Module voltage high alarm : Off
Module voltage low alarm : Off
Module voltage high warning : Off
Module voltage low warning : Off
Laser rx power high alarm : Off
Laser rx power low alarm : Off
Laser rx power high warning : Off
Laser rx power low warning : Off
Laser bias current high alarm threshold : 10.000 mA
Laser bias current low alarm threshold : 1.000 mA
Laser bias current high warning threshold : 9.000 mA
Laser bias current low warning threshold : 2.000 mA
Laser output power high alarm threshold : 0.8000 mW / -0.97 dBm
Laser output power low alarm threshold : 0.1000 mW / -10.00 dBm
Laser output power high warning threshold : 0.6000 mW / -2.22 dBm
Laser output power low warning threshold : 0.2000 mW / -6.99 dBm
Module temperature high alarm threshold : 90.00 degrees C / 194.00 degrees F
Module temperature low alarm threshold : -40.00 degrees C / -40.00 degrees F
Module temperature high warning threshold : 85.00 degrees C / 185.00 degrees F
Module temperature low warning threshold : -40.00 degrees C / -40.00 degrees F
Module voltage high alarm threshold : 4.0000 V
Module voltage low alarm threshold : 0.0000 V
Module voltage high warning threshold : 3.6450 V
Module voltage low warning threshold : 2.9550 V
Laser rx power high alarm threshold : 1.6000 mW / 2.04 dBm
Laser rx power low alarm threshold : 0.0100 mW / -20.00 dBm
Laser rx power high warning threshold : 1.0000 mW / 0.00 dBm
Laser rx power low warning threshold : 0.0200 mW / -16.99 dBm
Network Troubleshooting
Cumulus Linux includes command line and analytical tools to help you troubleshoot issues with your network.
Use ping
Use ping to check that a host is reachable. ping also calculates the time it takes for packets to travel round trip. See man ping for details.
To test the connection to an IPv4 host:
cumulus@switch:~$ ping 192.0.2.45
PING 192.0.2.45 (192.0.2.45) 56(84) bytes of data.
64 bytes from 192.0.2.45: icmp_req=1 ttl=53 time=40.4 ms
64 bytes from 192.0.2.45: icmp_req=2 ttl=53 time=39.6 ms
...
To test the connection to an IPv6 host:
cumulus@switch:~$ ping6 -I swp1 2001::db8:ff:fe00:2
PING 2001::db8:ff:fe00:2(2001::db8:ff:fe00:2) from 2001::db8:ff:fe00:1 swp1: 56 data bytes
64 bytes from 2001::db8:ff:fe00:2: icmp_seq=1 ttl=64 time=1.43 ms
64 bytes from 2001::db8:ff:fe00:2: icmp_seq=2 ttl=64 time=0.927 ms
When troubleshooting intermittent connectivity issues, it is helpful to send continuous pings to a host.
Print Route Trace with traceroute
Use traceroute to track the route that packets take from an IP network on their way to a given host. See man traceroute for details.
To track the route to an IPv4 host:
cumulus@switch:~$ traceroute www.google.com
traceroute to www.google.com (74.125.239.49), 30 hops max, 60 byte packets
1 cumulusnetworks.com (192.168.1.1) 0.614 ms 0.863 ms 0.932 ms
...
5 core2-1-1-0.pao.net.google.com (198.32.176.31) 22.347 ms 22.584 ms 24.328 ms
6 216.239.49.250 (216.239.49.250) 24.371 ms 25.757 ms 25.987 ms
7 72.14.232.35 (72.14.232.35) 27.505 ms 22.925 ms 22.323 ms
8 nuq04s19-in-f17.1e100.net (74.125.239.49) 23.544 ms 21.851 ms 22.604 ms
Run Commands in a Non-default VRF
You can use ip vrf exec to run commands in a non-default VRF context, which is useful for network utilities like ping, traceroute, and nslookup.
The full syntax is ip vrf exec <vrf-name> <command> <arguments>. For example:
cumulus@switch:~$ sudo ip vrf exec Tenant1 nslookup google.com - 8.8.8.8
By default, ping and ping6, and traceroute and traceroute6 all use the default VRF and use a mechanism that checks the VRF context of the current shell, which you can see when you run ip vrf id. If the VRF context of the shell is mgmt, these commands run in the default VRF context.
ping and traceroute have additional arguments that you can use to specify an egress interface or a source address. In the default VRF, the source interface flag (ping -I or traceroute -i) specifies the egress interface for the ping or traceroute operation. However, you can use the source interface flag instead to specify a non-default VRF to use for the command. Doing so causes the routing lookup for the destination address to occur in that VRF.
With ping -I, you can specify the source interface or the source IP address but you cannot use the flag more than once. Either choose an egress interface/VRF or a source IP address. For traceroute, you can use traceroute -s to specify the source IP address.
You gain additional flexibility if you run ip vrf exec in combination with ping/ping6 or traceroute/traceroute6, as the VRF context is outside of the ping and traceroute commands. This allows for the most granular control of ping and traceroute, as you can specify both the VRF and the source interface flag.
For ping, use the following syntax:
ip vrf exec <vrf-name> [ping|ping6] -I [<egress_interface> | <source_ip>] <destination_ip>
For example:
cumulus@switch:~$ sudo ip vrf exec Tenant1 ping -I swp1 8.8.8.8
cumulus@switch:~$ sudo ip vrf exec Tenant1 ping -I 192.0.1.1 8.8.8.8
cumulus@switch:~$ sudo ip vrf exec Tenant1 ping6 -I swp1 2001:4860:4860::8888
cumulus@switch:~$ sudo ip vrf exec Tenant1 ping6 -I 2001:db8::1 2001:4860:4860::8888
For traceroute, use the following syntax:
ip vrf exec <vrf-name> [traceroute|traceroute6] -i <egress_interface> -s <source_ip> <destination_ip>
The VRF context for ping and traceroute commands move automatically to the default VRF context, therefore, you must use the source interface flag to specify the management VRF. Typically, there is only a single interface in the management VRF (eth0) and only a single IPv4 address or IPv6 global unicast address assigned to it. You cannot specify both a source interface and a source IP address with ping -I.
Manipulate the System ARP Cache
The arp command manipulates or displays the kernel’s IPv4 network neighbor cache. See man arp for details.
To display the ARP cache:
cumulus@switch:~$ arp -a
? (11.0.2.2) at 00:02:00:00:00:10 [ether] on swp3
? (11.0.3.2) at 00:02:00:00:00:01 [ether] on swp4
? (11.0.0.2) at 44:38:39:00:01:c1 [ether] on swp1
To delete an ARP cache entry:
cumulus@switch:~$ arp -d 11.0.2.2
cumulus@switch:~$ arp -a
? (11.0.2.2) at <incomplete> on swp3
? (11.0.3.2) at 00:02:00:00:00:01 [ether] on swp4
? (11.0.0.2) at 44:38:39:00:01:c1 [ether] on swp1
To add a static ARP cache entry:
cumulus@switch:~$ arp -s 11.0.2.2 00:02:00:00:00:10
cumulus@switch:~$ arp -a
? (11.0.2.2) at 00:02:00:00:00:10 [ether] PERM on swp3
? (11.0.3.2) at 00:02:00:00:00:01 [ether] on swp4
? (11.0.0.2) at 44:38:39:00:01:c1 [ether] on swp1
If you need to flush or remove an ARP entry for a specific interface, you can disable dynamic ARP learning:
cumulus@switch:~$ ip link set arp off dev INTERFACE
Generate Traffic Using mz
mz (or mausezahn) is a fast traffic generator that can generate a large variety of packet types at high speed. See man mz for details.
For example, to send two sets of packets to TCP port 23 and 24, with source IP address 11.0.0.1 and destination IP address 11.0.0.2:
cumulus@switch:~$ sudo mz swp1 -A 11.0.0.1 -B 11.0.0.2 -c 2 -v -t tcp "dp=23-24"
Mausezahn 0.40 - (C) 2007-2010 by Herbert Haas - https://packages.debian.org/unstable/mz
Use at your own risk and responsibility!
-- Verbose mode --
This system supports a high resolution clock.
The clock resolution is 4000250 nanoseconds.
Mausezahn will send 4 frames...
IP: ver=4, len=40, tos=0, id=0, frag=0, ttl=255, proto=6, sum=0, SA=11.0.0.1, DA=11.0.0.2,
payload=[see next layer]
TCP: sp=0, dp=23, S=42, A=42, flags=0, win=10000, len=20, sum=0,
payload=
IP: ver=4, len=40, tos=0, id=0, frag=0, ttl=255, proto=6, sum=0, SA=11.0.0.1, DA=11.0.0.2,
payload=[see next layer]
TCP: sp=0, dp=24, S=42, A=42, flags=0, win=10000, len=20, sum=0,
payload=
IP: ver=4, len=40, tos=0, id=0, frag=0, ttl=255, proto=6, sum=0, SA=11.0.0.1, DA=11.0.0.2,
payload=[see next layer]
TCP: sp=0, dp=23, S=42, A=42, flags=0, win=10000, len=20, sum=0,
payload=
IP: ver=4, len=40, tos=0, id=0, frag=0, ttl=255, proto=6, sum=0, SA=11.0.0.1, DA=11.0.0.2,
payload=[see next layer]
TCP: sp=0, dp=24, S=42, A=42, flags=0, win=10000, len=20, sum=0,
payload=
Create Counter ACL Rules
In Linux, all ACL rules are always counted. To create an ACL rule for counting purposes only, set the rule action to ACCEPT. For details on how to use cl-acltool to set up iptables, ip6tables, and ebtables-based ACLs, see Netfilter ACLs.
Always place your rule files under /etc/cumulus/acl/policy.d/.
To count all packets going to a Web server:
cumulus@switch:~$ cat sample_count.rules
[iptables]
-A FORWARD -p tcp --dport 80 -j ACCEPT
cumulus@switch:~$ sudo cl-acltool -i -p sample_count.rules
Using user provided rule file sample_count.rules
Reading rule file sample_count.rules ...
Processing rules in file sample_count.rules ...
Installing acl policy... done.
cumulus@switch:~$ sudo iptables -L -v
Chain INPUT (policy ACCEPT 16 packets, 2224 bytes)
pkts bytes target prot opt in out source destination
Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
2 156 ACCEPT tcp -- any any anywhere anywhere tcp dpt:http
Chain OUTPUT (policy ACCEPT 44 packets, 8624 bytes)
pkts bytes target prot opt in out source destination
The -p option clears out all other rules. The -i option reinstalls all the rules.
Monitor Control Plane Traffic with tcpdump
You can use tcpdump to monitor control plane traffic (traffic sent to and coming from the switch CPUs). tcpdump does not monitor data plane traffic; use cl-acltool instead (see above).
The following example incorporates tcpdump options:
-i bond0 captures packets from bond0 to the CPU and from the CPU to bond0
host 169.254.0.2 filters for this IP address
-c 10 captures 10 packets then stops
cumulus@switch:~$ sudo tcpdump -i bond0 host 169.254.0.2 -c 10
tcpdump: WARNING: bond0: no IPv4 address assigned
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond0, link-type EN10MB (Ethernet), capture size 65535 bytes
16:24:42.532473 IP 169.254.0.2 > 169.254.0.1: ICMP echo request, id 27785, seq 6, length 64
16:24:42.532534 IP 169.254.0.1 > 169.254.0.2: ICMP echo reply, id 27785, seq 6, length 64
16:24:42.804155 IP 169.254.0.2.40210 > 169.254.0.1.5342: Flags [.], seq 266275591:266277039, ack 3813627681, win 58, options [nop,nop,TS val 590400681 ecr 530346691], length 1448
16:24:42.804228 IP 169.254.0.1.5342 > 169.254.0.2.40210: Flags [.], ack 1448, win 166, options [nop,nop,TS val 530348721 ecr 590400681], length 0
16:24:42.804267 IP 169.254.0.2.40210 > 169.254.0.1.5342: Flags [P.], seq 1448:1836, ack 1, win 58, options [nop,nop,TS val 590400681 ecr 530346691], length 388
16:24:42.804293 IP 169.254.0.1.5342 > 169.254.0.2.40210: Flags [.], ack 1836, win 165, options [nop,nop,TS val 530348721 ecr 590400681], length 0
16:24:43.532389 IP 169.254.0.2 > 169.254.0.1: ICMP echo request, id 27785, seq 7, length 64
16:24:43.532447 IP 169.254.0.1 > 169.254.0.2: ICMP echo reply, id 27785, seq 7, length 64
16:24:43.838652 IP 169.254.0.1.59951 > 169.254.0.2.5342: Flags [.], seq 2555144343:2555145791, ack 2067274882, win 58, options [nop,nop,TS val 530349755 ecr 590399688], length 1448
16:24:43.838692 IP 169.254.0.1.59951 > 169.254.0.2.5342: Flags [P.], seq 1448:1838, ack 1, win 58, options [nop,nop,TS val 530349755 ecr 590399688], length 390
10 packets captured
12 packets received by filter
0 packets dropped by kernel
What Just Happened (WJH)
What Just Happened (WJH) provides real time visibility into network problems and has two components:
The WJH agent enables you to stream detailed and contextual telemetry for off-switch analysis with tools such as NVIDIA NetQ.
The WJH service (what-just-happened) enables you to diagnose network problems by looking at dropped packets. WJH can monitor layer 1, layer 2, layer 3, tunnel, buffer and ACL related issues. Cumulus Linux enables and runs the WJH service by default.
Configure WJH
You can choose which packet drops you want to monitor by creating channels and setting the packet drop categories (layer 1, layer 2, layer 3, tunnel, buffer and ACL ) you want to monitor.
NVUE does not provide commands to set the buffer and ACL packet drop categories. You must edit the /etc/what-just-happened/what-just-happened.json file. See the Linux Commands tab.
The following example configures two separate channels:
The forwarding channel monitors layer 2, layer 3, and tunnel packet drops.
The layer-1 channel monitors layer 1 packet drops.
cumulus@switch:~$ nv set system wjh channel forwarding trigger l2
cumulus@switch:~$ nv set system wjh channel forwarding trigger l3
cumulus@switch:~$ nv set system wjh channel forwarding trigger tunnel
cumulus@switch:~$ nv set system wjh channel layer-1 trigger l1
cumulus@switch:~$ nv config apply
You can stop monitoring specific packet drops by unsetting a category in the channel list. The following command example stops monitoring layer 2 packet drops that are in the forwarding channel:
You can run the following commands to show information about dropped packets and diagnose problems.
To show information about packet drops for all the channels you configure, run the nv show system wjh packet-buffer command. The command output includes the reason for the drop and the recommended action to take.
You can also show the WJH configuration on the switch:
To show the configuration for a channel, run the nv show system wjh channel <channel> command. For example, nv show system wjh channel forwarding.
To show the configuration for packet drop categories in a channel, run the nv show system wjh channel <channel> trigger command. For example, nv show system wjh channel forwarding trigger.
The following example shows information about layer 1 packet drops:
You can run the following commands from the command line.
Command
Description
what-just-happened poll
Shows information about packet drops for all the channels you configure. The output includes the reason for the drop and the recommended action to take.
The what-just-happened poll <channel> command shows information for the channel you specify.
what-just-happened poll --aggregate
Shows information about dropped packets aggregated by the reason for the drop. This command also shows the number of times the dropped packet occurs.
The what-just-happened poll <channel> --aggregate command shows information for the channel you specify.
what-just-happened poll --export
Saves information about dropped packets to a file in PCAP format.
The what-just-happened poll <channel> --export command saves information for the channel you specify.
what-just-happened poll --export --no_metadata
Saves information about dropped packets to a file in PCAP format without metadata.
The what-just-happened poll <channel> --export --no_metadata command saves information for the channel you specify.
what-just-happened dump
Displays all diagnostic information on the command line.
Run the what-just-happened -h command to see all the WJH command options.
To show all dropped packets and the reason for the drop, run the NVUE nv show system wjh packet-buffer command or the what-just-happened poll command.
The following example shows that packets drop five times because the source MAC address equals the destination MAC address:
cumulus@switch:~$ what-just-happened poll --aggregate
Sample Window : 2021/06/16 12:57:23.046 - 2021/06/16 14:46:17.701
# sPort VLAN sMAC dMAC EthType Src IP:Port Dst IP:Port IP Proto Count Severity Drop reason - Recommended action
-- ------ ----- ------------------ ------------------ -------- ------------ ------------ --------- ------ --------- -----------------------------------------------
1 swp4 N/A 44:38:39:00:a4:87 44:38:39:00:a4:87 IPv4 0.0.0.0:0 0.0.0.0:0 ip 100 Error Source MAC equals destination MAC - Bad packet was received from peer
2 swp1 N/A 44:38:39:00:a4:80 44:38:39:00:a4:80 IPv4 0.0.0.0:0 0.0.0.0:0 ip 100 Error Source MAC equals destination MAC - Bad packet was received from peer
The following command saves dropped packets to a file in PCAP format
cumulus@switch:~$ what-just-happened poll --export --no_metadata
PCAP file path : /var/log/mellanox/wjh/wjh_user_2021_06_16_12_03_15.pcap
# Timestamp sPort dPort VLAN sMAC dMAC EthType Src IP:Port Dst IP:Port IP Proto Drop Severity Drop reason - Recommended action
Group
---- ---------------------- ------ ------ ----- ------------------ ------------------ -------- ------------ ------------ --------- ------ --------- -----------------------------------------------
1 21/06/16 12:03:12.728 swp1 N/A N/A 44:38:39:00:a4:84 44:38:39:00:a4:84 IPv4 N/A N/A N/A L2 Error Source MAC equals destination MAC - Bad packet as received from peer
2 21/06/16 12:03:12.728 swp1 N/A N/A 44:38:39:00:a4:84 44:38:39:00:a4:84 IPv4 N/A N/A N/A L2 Error Source MAC equals destination MAC - Bad packet was received from peer
3 21/06/16 12:03:12.745 swp1 N/A N/A 44:38:39:00:a4:84 44:38:39:00:a4:84 IPv4 N/A N/A N/A L2 Error Source MAC equals destination MAC - Bad packet was received from peer
4 21/06/16 12:03:12.745 swp1 N/A N/A 44:38:39:00:a4:84 44:38:39:00:a4:84 IPv4 N/A N/A N/A L2 Error Source MAC equals destination MAC - Bad packet was received from peer
Considerations
Buffer Packet Drop Monitoring
Buffer packet drop monitoring is available on a switch with Spectrum-2 and later.
Buffer packet drop monitoring uses a SPAN destination. If you configure SPAN, ensure that you do not exceed the total number of SPAN destinations allowed for your switch ASIC type; see SPAN and ERSPAN. If you need to remove the SPAN destination that buffer packet drop monitoring uses, delete the buffer monitoring drop category from the /etc/what-just-happened/what-just-happened.json file and reload the what-just-happened service.
Cumulus Linux and Docker
WJH runs in a Docker container. By default, when Docker starts, it creates a bridge called docker0. However, for compatibility reasons Cumulus Linux disables the docker0 bridge in the /etc/docker/daemon.json file with the attribute "bridge: none".
WJH and the NVIDIA NetQ Agent
When you enable the NVIDIA NetQ agent on the switch, the WJH service stops and does not run. If you disable the NVIDIA NetQ service and want to use WJH, run the following commands to enable and start the WJH service:
cumulus@switch:~$ nv set system wjh enable on
cumulus@switch:~$ nv config apply
Monitoring System Statistics and Network Traffic with sFlow
sFlow is a monitoring protocol that samples network packets, application operations, and system counters. sFlow collects both interface counters and sampled 5-tuple packet information so that you can monitor your network traffic as well as your switch state and performance metrics. To collect and analyze this data, you need an outside server; an sFlow collector.
hsflowd is the daemon that samples and sends sFlow data to configured collectors. By default, Cumulus Linux disables hsflowd and it does not start automatically when the switch boots up.
If you intend to run this service within a VRF, including the management VRF, follow these steps to configure the service.
Configure sFlow
To configure hsflowd to send to the designated collectors, either:
Use DNS service discovery (DNS-SD)
Manually configure the /etc/hsflowd.conf file
Configure sFlow with DNS-SD
You can configure your DNS zone to advertise the collectors and polling information to all interested clients. Add the following content to the zone file on your DNS server:
The above snippet instructs hsflowd to send sFlow data to collector1 on port 6343 and to collector2 on port 6344. hsflowd polls counters every 20 seconds and samples 1 out of every 2048 packets.
The hardware can deliver a maximum of 16K samples per second. You can configure the number of samples per second in the /etc/cumulus/datapath/traffic.conf file:
# Set sflow/sample ingress cpu packet rate and burst in packets/sec
# Values: {0..16384}
#sflow.rate = 16384
#sflow.burst = 16384
You do not need to configure anything else in the /etc/hsflowd.conf file.
Manually Configure /etc/hsflowd.conf
You can set up the collectors and variables on each switch. Edit the /etc/hsflowd.conf file to set up your collectors and sampling rates in the /etc/hsflowd.conf file. For example:
This configuration polls the counters every 20 seconds, samples 1 of every 40000 packets for 40G interfaces, and sends this information to a collector at 192.0.2.100 on port 6343 and to another collector at 192.0.2.200 on port 6344.
Some collectors require each source to transmit on a different port, others listen on only one port. Refer to the documentation for your collector for more information.
To configure the IP address for the sFlow agent, configure one of the following in the /etc/hsflowd.conf file (following the recommendations in the sFlow documentation):
The agent CIDR. For example, agent.cidr = 10.0.0.0/8. The IP address must fall within this range.
The agent interface. For example, if the agent is using eth0, select the IP address for this interface.
To check the agent IP, run this command:
cumulus@switch:~$ grep agentIP /etc/hsflowd.auto
Configure sFlow Visualization Tools
For information on configuring various sFlow visualization tools, read this knowledge base article.
Considerations
Cumulus Linux does not support sFlow egress sampling.
SPAN mirrors all packets that come in from or go out of an interface (the SPAN source), and copy and transmit the packets out of a local port or CPU (the SPAN destination) for monitoring. The SPAN destination port is also referred to as a mirror-to-port (MTP). The original packet is still switched, while a mirrored copy of the packet goes out of the MTP.
ERSPAN sends the mirrored packets to a monitoring node located anywhere across the routed network. The switch finds the outgoing port of the mirrored packets by looking up the destination IP address in its routing table. The switch encapsulates the original layer 2 packet with GRE for IP delivery. The encapsulated packets have the following format:
To configure SPAN to mirror ports on your switch, you create a port mirror session. The session ID is a number between 0 and 7.
You set the following SPAN options:
Source port
Destination port
Direction (ingress or egress)
Run the nv set system port-mirror session <session-id> span <option> command. The NVUE commands save the configuration in the /etc/cumulus/switchd.d/port-mirror.conf file.
To reduce the volume of data, you can truncate the mirrored frames at a specified number of bytes. The size must be between 4 and 4088 bytes and a multiple of 4.
Example Commands
To mirror all packets received on swp1, and copy and transmit the packets to swp2 for monitoring:
cumulus@switch:~$ nv set system port-mirror session 1 span direction ingress
cumulus@switch:~$ nv set system port-mirror session 1 span source-port swp1
cumulus@switch:~$ nv set system port-mirror session 1 span destination swp2
cumulus@switch:~$ nv config apply
To mirror all packets that go out of swp1, and copy and transmit the packets to swp2 for monitoring:
cumulus@switch:~$ nv set system port-mirror session 1 span direction egress
cumulus@switch:~$ nv set system port-mirror session 1 span source-port swp1
cumulus@switch:~$ nv set system port-mirror session 1 span destination swp2
cumulus@switch:~$ nv config apply
SPAN sessions that reference an outgoing interface create the mirrored packets according to the ingress interface before the routing decision. For example, the above commands capture traffic that is ultimately destined to leave swp1 but mirrors the packets when they arrive on swp2. Packets that reference the original VLAN tag, and the source and destination MAC address transfer when swp2 originally receives the packet.
To mirror packets from all ports to swp53:
cumulus@switch:~$ nv set system port-mirror session 1 span direction ingress
cumulus@switch:~$ nv set system port-mirror session 1 span source-port swp1-54
cumulus@switch:~$ nv set system port-mirror session 1 span destination swp53
cumulus@switch:~$ nv config apply
To mirror all packets received on bond1, and copy and transmit the packets to swp53 for monitoring:
cumulus@switch:~$ nv set system port-mirror session 1 span direction ingress
cumulus@switch:~$ nv set system port-mirror session 1 span source-port bond1
cumulus@switch:~$ nv set system port-mirror session 1 span destination swp53
cumulus@switch:~$ nv config apply
To truncate the mirrored frames at 40 bytes:
cumulus@switch:~$ nv set system port-mirror session 1 span truncate size 40
cumulus@switch:~$ nv config apply
Delete SPAN Sessions
You can delete all SPAN sessions with the nv unset system port-mirror command. For example:
cumulus@switch:~$ nv unset system port-mirror
cumulus@switch:~$ nv config apply
To delete a specific SPAN session, run the nv unset system port-mirror session <session-id> command. For example:
SPAN sessions that reference an outgoing interface create the mirrored packets according to the ingress interface before the routing decision. For example, the following rule captures traffic that is ultimately destined to leave swp1 but mirrors the packets when they arrive on swp49. The rule transmits packets that reference the original VLAN tag, and source and destination MAC address at the time that swp49 originally receives the packet.
You can configure selective SPAN with ACLs to mirror a subset of traffic according to:
Source or destination IP address
IP protocol
TCP or UDP source or destination port
TCP flags
An ingress port
To match swp1 ingress traffic that has the source IP address 10.10.1.1 and mirror the traffic to swp2 when a match occurs:
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 match ip source-ip 10.10.1.1
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 action span swp2
cumulus@switch:~$ nv set interface swp1 acl EXAMPLE1 inbound
To match OSPF packets coming in on swp1 and mirror the traffic to swp2 when a match occurs:
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 match ip protocol ospf
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 action span swp2
cumulus@switch:~$ nv set interface swp1 acl EXAMPLE1 inbound
To match UDP packets coming in on bond1 and mirror the traffic to swp53 when a match occurs:
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 match ip protocol udp
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 action span swp53
cumulus@switch:~$ nv set interface bond1 acl EXAMPLE1 inbound
cumulus@switch:~$ nv config apply
Always place your rule files in the /etc/cumulus/acl/policy.d/ directory.
Using cl-acltool with the --out-interface rule applies to transit traffic only; it does not apply to traffic sourced from the switch.
--out-interface rules cannot target bond interfaces, only the bond members tied to them. For example, to mirror all packets going out of bond1 to swp53, where bond1 members are swp1 and swp2, create the rule -A FORWARD --out-interface swp1,swp2 -j SPAN --dport swp53.
Create a rules file in the /etc/cumulus/acl/policy.d/ directory. The following example rules mirror ICMP packets that ingress swp1 to swp54 and UDP packets that egress swp4 to swp53:
Do not run the cl-acltool -i command with -P option. The -P option removes all existing control plane rules or other installed rules and only installs the rules defined in the specified file.
Verify that you installed the SPAN rules:
cumulus@switch:~$ sudo cl-acltool -L all | grep SPAN
38025 7034K SPAN icmp -- swp1 any anywhere anywhere dport:swp54
50832 55M SPAN udp -- any swp4 anywhere anywhere dport:swp53
Example Rules
To mirror forwarded packets from all ports matching source IP address 20.0.1.0 and destination IP address 20.0.1.2 to port swp1:
To mirror all forwarded TCP packets with only SYN set:
-A FORWARD --in-interface swp+ -p tcp --tcp-flags ALL SYN -j SPAN --dport swp1
To mirror all forwarded TCP packets with only FIN set:
-A FORWARD --in-interface swp+ -p tcp --tcp-flags ALL FIN -j SPAN --dport swp1
CPU port as the SPAN Destination
You can set the CPU port as a SPAN destination interface to mirror data plane traffic to the CPU. The SPAN traffic goes to a separate network interface mirror where you can analyze it with tcpdump. This is a useful feature if you do not have any free external ports on the switch for monitoring. SPAN traffic does not appear on switch ports.
Cumulus Linux controls how much traffic reaches the CPU so that mirrored traffic does not overwhelm the CPU.
You configure the CPU port as the SPAN destination with ACLs.
To monitor traffic mirrored to the CPU, run the tcpcdump -i mirror command.
Cumulus Linux does not support egress mirroring for control plane generated traffic to the CPU port.
When you set the CPU port as a SPAN destination interface, Cumulus Linux mirrors packets that match the rule on both ingress and egress only once to the destination interface.
To match swp1 ingress traffic that has the source IP address 10.10.1.1 and mirror the traffic to the CPU when a match occurs:
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 action span cpu
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 match ip source-ip 10.10.1.1
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set interface swp1 acl EXAMPLE1 inbound
cumulus@switch:~$ nv config apply
To match swp1 egress traffic that has the source IP address 10.10.1.1 and mirror the traffic to the CPU when a match occurs:
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 action span cpu
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 match ip source-ip 10.10.1.1
cumulus@switch:~$ nv set acl EXAMPLE1 type ipv4
cumulus@switch:~$ nv set interface swp1 acl EXAMPLE1 outbound
cumulus@switch:~$ nv config apply
Create a file in the /etc/cumulus/acl/policy.d/ directory and add rules.
To match swp1 ingress traffic that has the source IP address 10.10.1.1 and mirror the traffic to the CPU when a match occurs:
cumulus@switch:~$ sudo nano /etc/cumulus/acl/policy.d/span-cpu.rules
[iptables]
-A FORWARD -i swp1 -s 10.10.1.1 -j SPAN --dport cpu
To match swp1 egress traffic that has the source IP address 10.10.1.1 and mirror the traffic to the CPU when a match occurs:
-A FORWARD -o swp1 -s 10.10.1.1 -j SPAN --dport cpu
Install the rule:
cumulus@switch:~$ sudo cl-acltool -i
Do not run the cl-acltool -i command with -P option. The -P option removes all existing control plane rules or other installed rules and only installs the rules defined in the specified file.
ERSPAN
To configure ERSPAN to mirror ports on your switch, you create a port mirror session. The session ID is a number between 0 and 7.
You can set the following ERSPAN options:
Source port
Direction (ingress or egress)
Source IP address for ERSPAN encapsulation
Destination IP address for ERSPAN encapsulation
Run the nv set system port-mirror session <session-id> erspan <option> command. The NVUE commands save the configuration in the /etc/cumulus/switchd.d/port-mirror.conf file.
To reduce the volume of data, you can truncate the mirrored frames at a specified number of bytes. The size must be between 4 and 4088 bytes and a multiple of 4.
Example Commands
The following examples configure ERSPAN encapsulation from source IP address 10.10.10.1 to destination IP address 10.10.10.234.
To mirror all packets that arrive on swp1:
cumulus@switch:~$ nv set system port-mirror session 1 erspan direction ingress
cumulus@switch:~$ nv set system port-mirror session 1 erspan source-port swp1
cumulus@switch:~$ nv set system port-mirror session 1 erspan destination source-ip 10.10.10.1
cumulus@switch:~$ nv set system port-mirror session 1 erspan destination dest-ip 10.10.10.234
cumulus@switch:~$ nv config apply
To mirror all packets that go out of swp1:
cumulus@switch:~$ nv set system port-mirror session 1 erspan direction egress
cumulus@switch:~$ nv set system port-mirror session 1 erspan source-port swp1
cumulus@switch:~$ nv set system port-mirror session 1 erspan destination source-ip 10.10.10.1
cumulus@switch:~$ nv set system port-mirror session 1 erspan destination dest-ip 10.10.10.234
cumulus@switch:~$ nv config apply
Delete ERSPAN Sessions
You can delete all ERSPAN sessions with the nv unset system port-mirror command. For example:
cumulus@switch:~$ nv unset system port-mirror
cumulus@switch:~$ nv config apply
To delete a specific ERSPAN session, run the nv unset system port-mirror session <session-id> command. For example:
If you use Wireshark to review the ERSPAN output, you might see the Wireshark error message Unknown version, please report or test to use fake ERSPAN preference and the trace might be unreadable. To resolve this issue, go to Protocols \ ERSPAN from the Wireshark General preferences and check the Force to decode fake ERSPAN frame option.
To set up a capture filter on the destination switch that filters for a specific IP protocol, use ip.proto == 47 to filter for GRE-encapsulated (IP protocol 47) traffic.
Selective ERSPAN with ACLs
You can configure selective ERSPAN with ACLs to mirror a subset of traffic according to:
Source or destination IP address
IP protocol
TCP or UDP source or destination port
TCP flags
An ingress port
The following command mirrors inbound ICMP packets from all swp interfaces. The source IP address for ERSPAN encapsulation is 10.10.10.1 and the destination IP address for ERSPAN encapsulation is 10.10.10.234.
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 type ipv4
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 match ip protocol icmp
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 action erspan source-ip 10.10.10.1
cumulus@switch:~$ nv set acl EXAMPLE1 rule 1 action erspan dest-ip 10.10.10.234
cumulus@switch:~$ nv set interface swp1-54 acl EXAMPLE1 inbound
cumulus@switch:~$ nv config apply
Create a rules file in /etc/cumulus/acl/policy.d/. The following rule configures ERSPAN for all ICMP packets that ingress swp1. The source IP address for ERSPAN encapsulation is 10.10.10.1 and the destination IP address for ERSPAN encapsulation is 10.10.10.234.
src-ip can be any IP address, even if it does not exist in the routing table.
dst-ip must be an IP address reachable through the routing table and front-panel port (not the management port) or SVI. Use ping or ip route get to verify that the destination IP address is reachable.
Install the rules:
cumulus@switch:~$ sudo cl-acltool -i
Do not run the cl-acltool -i command with -P option. The -P option removes all existing control plane rules or other installed rules and only installs the rules defined in the specified file.
In the following example rules, the source IP address for ERSPAN encapsulation is 10.10.10.1 and the destination IP address for ERSPAN encapsulation is 10.10.10.234.
To mirror forwarded packets from all ports matching the source IP address 20.0.0.2 and the destination IP address 20.0.1.2:
To mirror all forwarded TCP packets with only SYN set:
-A FORWARD --in-interface swp+ -p tcp --tcp-flags ALL SYN -j ERSPAN --src-ip 10.10.10.1 --dst-ip 10.10.10.234
To mirror all forwarded TCP packets with only FIN set:
-A FORWARD --in-interface swp+ -p tcp --tcp-flags ALL FIN -j ERSPAN --src-ip 10.10.10.1 --dst-ip 10.10.10.234
Show SPAN and ERSPAN Configuration
To show SPAN and ERSPAN configuration for a specific session, run the NVUE nv show system port-mirror session <session-id> command. To show SPAN and ERSPAN configuration for all sessions, run the NVUE nv show system port-mirror command.
cumulus@switch:~$ nv show system port-mirror session 1
operational applied pending
--------------- ----------- ------- -------
erspan
enable off
span
enable on
direction ingress
[destination]
[source-port] swp1
truncate
enable off
You can also run the sudo cl-acltool -L all | grep SPAN or sudo cl-acltool -L all | grep ERSPAN command.
cumulus@switch:~$ sudo cl-acltool -L all | grep SPAN
0 0 SPAN all -- any swp1 10.10.10.1 anywhere /* rule_id:1,acl_name:EXAMPLE1,dir:outbound,interface_id:swp1 */ dport:cpu
Limitations
On a switch with the Spectrum-2 ASIC or later, Cumulus Linux supports four SPAN destinations in atomic mode or eight SPAN destinations in non-atomic mode. On a switch with the Spectrum 1 ASIC, Cumulus Linux supports only a single SPAN destination in atomic mode or three SPAN destinations in non-atomic mode.
WJH buffer drop monitoring uses a SPAN destination; if you configure What Just Happened (WJH), ensure that you do not exceed the total number of SPAN destinations allowed for your switch ASIC type.
Multiple SPAN sources can point to the same SPAN destination, but a SPAN source cannot specify two SPAN destinations.
Cumulus Linux does not support IPv6 ERSPAN destinations.
You cannot use eth0 as a destination.
You cannot mirror packets that egress a bond interface (such as bond1); you can only mirror packets that egress bond members (such as swp1, swp2 and so on).
Mirrored traffic is not guaranteed. A congested MTP results in discarded mirrored packets.
A oversubscribed SPAN and ERSPAN destination interface might result in data plane buffer depletion and buffer drops. Exercise caution when enabling SPAN and ERSPAN when the aggregate speed of all source ports exceeds the destination port.
ERSPAN does not cause the kernel to send ARP requests to resolve the next hop for the ERSPAN destination. If an ARP entry for the destination or next hop does not already exist in the kernel, you need to manually resolve this before sending mirrored traffic (use ping or arping).
Mirroring to the same interface that you are monitoring causes a recursive flood of traffic and might impact traffic on other interfaces.
Cumulus VX does not support ACL-based SPAN, ERSPAN, or port mirroring. To capture packets in Cumulus VX, use the tcpdump command line network traffic analyzer.
When you configure ERSPAN sessions with the NVUE nv set system port-mirror commands, the destination IP address must be reachable from the source IP address through the default VRF.
Simple Network Management Protocol - SNMP
SNMP is an IETF standards-based network management architecture and protocol. Cumulus Linux uses the open source Net-SNMP agent snmpd, which provides support for most of the common industry-wide MIBs, including interface counters, and TCP and UDP IP stack data. The SNMP version in Cumulus Linux adds custom MIBs and pass-through, and pass-persist scripts.
SNMP Components
The main components of SNMP in Cumulus Linux include:
The SNMP network management system (NMS)
SNMP agents
The MIBs (management information bases)
SNMP Network Management System
An SNMP network management system (NMS) is a system configured to poll SNMP agents (such as Cumulus Linux switches or routers), which respond with data. A variety of command line tools exist to poll agents, such as snmpget, snmpgetnext, snmpwalk, snmpbulkget, and snmpbulkwalk. SNMP agents can also send unsolicited traps and inform messages to the NMS based on predefined criteria, such as link changes.
SNMP Agent
The SNMP agent (snmpd) running on a Cumulus Linux switch gathers information about the local system and stores the data in a MIB. Parts of the MIB tree are available and provided to incoming requests originating from an NMS host that has authenticated with the correct credentials. You can configure the Cumulus Linux switch with usernames and credentials to provide authenticated and encrypted responses to NMS requests. The snmpd agent can also proxy requests and act as a master agent to sub-agents running on other daemons, such as FRR or LLDP.
Management Information Base (MIB)
The MIB is a database for the snmpd service that runs on the agent. MIBs adhere to IETF standards but are flexible enough to allow vendor-specific additions. Cumulus Linux includes custom enterprise MIB tables in a set of text files on the switch; the files are in /usr/share/snmp/mibs/ and their names all start with Cumulus; for example, Cumulus-Counters-MIB.txt.
The MIB is a top-down hierarchical tree. Each branch that forks off has both an identifying number (starting with 1) and an identifying string that is unique for that level of the hierarchy. You can use the strings and numbers interchangeably. The parent IDs (numbers or strings) combine, starting with the most general to form an address for the MIB object. A dot in this notation represents each junction in the hierarchy so that the address is a series of ID strings or numbers separated by dots. This entire address is an object identifier (OID).
You can use various online and command line tools to translate between numbers and strings and to also provide definitions for the various MIB objects. For example, you can view the sysLocation object (in SNMPv2-MIB.txt) in the system table as either a series of numbers 1.3.6.1.2.1.1.6 or as the string iso.org.dod.internet.mgmt.mib-2.system.sysLocation. You view the definition with the snmptranslate command, which is part of the snmp Debian package in Cumulus Linux.
cumulus@switch:~$ snmptranslate -Td -On SNMPv2-MIB::sysLocation
.1.3.6.1.2.1.1.6
sysLocation OBJECT-TYPE
-- FROM SNMPv2-MIB
-- TEXTUAL CONVENTION DisplayString
SYNTAX OCTET STRING (0..255)
DISPLAY-HINT "255a"
MAX-ACCESS read-write
STATUS current
DESCRIPTION "The physical location of this node (e.g., 'telephone
closet, 3rd floor'). If the location is unknown, the
value is the zero-length string."
::= { iso(1) org(3) dod(6) internet(1) mgmt(2) mib-2(1) system(1) 6 }
In the last line above, the section 1.3.6.1 or iso.org.dod.internet is the OID that defines internet resources. The 2 or mgmt that follows is for a management subcategory. The 1 or mib-2 under that defines the MIB-2 specification. The 1 or system is the parent for child objects sysDescr, sysObjectID, sysUpTime, sysContact, sysName, sysLocation, sysServices, and so on, as you see in the tree output from the second snmptranslate command below, where sysLocation is 6.
The most basic SNMP configuration requires you to:
Enable and start the snmpd service.
Specify one or more IP addresses on which the SNMP agent listens.
Specify either a username (for SNMPv3) or a read-only community string (a password, for SNMPv1 or SNMPv2c).
By default, the SNMP configuration has a listening address of localhost (127.0.0.1), which allows the agent (the snmpd service) to respond to SNMP requests originating on the switch itself. This is a secure method that allows checking the SNMP configuration without exposing the switch to outside attacks. For an external SNMP NMS to poll a Cumulus Linux switch, you must configure the snmpd service running on the switch to listen to one or more IP addresses on interfaces that have a link state UP.
Use the SNMPv3 username instead of the read-only community name. The SNMPv3 username does not expose the user credentials and can encrypt packet contents. However, SNMPv1 and SNMPv2c environments require read-only community passwords so that the snmpd daemon can respond to requests. The read-only community string enables you to poll various MIB objects on the device.
Basic Configuration
Before you can use SNMP, you need to enable and start the snmpd service, and configure a listening address.
cumulus@switch:~$ nv set service snmp-server enable on
cumulus@switch:~$ nv set service snmp-server listening-address localhost
cumulus@switch:~$ nv config apply
If you intend to run this service within a VRF, including the management VRF, follow these steps for configuring the service.
You do not need to run SNMP in the management VRF if you just want to allow SNMP communication through the management VRF interfaces; see SNMP and VRFs.
To enable the snmpd service to restart automatically after failure, create a file called /etc/systemd/system/snmpd.service.d/restart.conf and add the following lines:
[Service]
Restart=always
RestartSec=60
Edit the /etc/snmp/snmpd.conf file and add the IP address, protocol and port for snmpd to listen for incoming requests. You can use multiple lines to define multiple listening addresses or use a comma-separated list on a single line.
The listening address is localhost by default so that the SNMP agent only responds to requests originating on the switch itself in the default VRF. To configure the switch to respond to requests sent to localhost in a mgmt VRF shell, see SNMP and VRFs. You can also configure listening only on the IPv6 localhost address. When using IPv6 addresses or localhost, you can use a readonly-community-v6 for SNMPv1 and SNMPv2c requests. For SNMPv3 requests, you can use the username command to restrict access. See Configure the SNMPv3 Username below.
The IP address must exist on an interface that has link UP on the switch where you use snmpd. By default, the IP address is udp:127.0.0.1:161, so snmpd only responds to requests (such as snmpwalk, snmpget, snmpgetnext) that originate from the switch. A wildcard setting of udp:161,udp6:161 forces snmpd to listen on all IPv4 and IPv6 interfaces for incoming SNMP requests.
You can configure multiple IP addresses and bind to a particular IP address within a particular VRF table.
To configure the listening IP addresses:
cumulus@switch:~$ nv set service snmp-server listening-address localhost
cumulus@switch:~$ nv set service snmp-server listening-address localhost-v6
cumulus@switch:~$ nv config apply
To configure the snmpd daemon to listen on all interfaces for either IPv4 or IPv6 UDP port 161 SNMP requests, run the following command, which removes all other individual IP addresses configured:
cumulus@switch:~$ nv set service snmp-server listening-address all
cumulus@switch:~$ nv set service snmp-server listening-address all-v6
cumulus@switch:~$ nv config apply
To configure snmpd to listen to a specific IPv4 or IPv6 address:
cumulus@switch:~$ nv set service snmp-server listening-address 192.168.200.11
cumulus@switch:~$ nv config apply
To configure snmpd to listen to multiple addresses for incoming SNMP queries, separate the addresses with a space:
cumulus@switch:~$ nv set service snmp-server listening-address 192.168.200.11 192.168.200.21
cumulus@switch:~$ nv config apply
Edit the /etc/snmp/snmpd.conf file and add the IP address, protocol and port for snmpd to listen for incoming requests. You can use multiple lines to define multiple listening addresses or use a comma-separated list on a single line.
Cumulus Linux provides a listening address for VRFs together with trap and inform support. You can configure snmpd to listen to a specific IPv4 or IPv6 address on an interface within a particular VRF. With VRFs, identical IP addresses can exist in different VRF tables. This command restricts listening to a particular IP address within a particular VRF. If you do not provide a VRF name, Cumulus Linux uses the default VRF.
The following command configures snmpd to listen to IP address 10.10.10.10 on eth0, the management interface in the management VRF:
cumulus@switch:~$ nv set service snmp-server listening-address 10.10.10.10 vrf mgmt
cumulus@switch:~$ nv config apply
By default, snmpd does not cross VRF table boundaries. To listen on IP addresses in different VRF tables, use multiple listening-address commands each with a VRF name:
cumulus@switch:~$ nv set service snmp-server listening-address 10.10.10.10 vrf rocket
cumulus@switch:~$ nv set service snmp-server listening-address 10.10.10.20 vrf turtle
cumulus@switch:~$ nv config apply
By default, snmpd only responds to localhost requests in the default VRF. You can configure the switch to respond to requests sent to localhost in a mgmt VRF shell. To configure the snmpd daemon to listen on localhost in the mgmt VRF, run:
cumulus@switch:~$ nv set service snmp-server listening-address localhost vrf mgmt
cumulus@switch:~$ nv config apply
To bind to a particular IP address within a particular VRF table, edit the /etc/snmp/snmpd.conf file and append @ and the name of the VRF table to the IP address (for example, 192.168.200.11@mgmt).
By default, snmpd only responds to localhost requests in the default VRF. You can configure the switch to respond to requests sent to localhost in a mgmt VRF shell. Edit the /etc/snmp/snmpd.conf file and add @mgmt to the agentaddress configuration:
Then restart snmpd with the sudo systemctl restart snmpd command.
Configure the SNMPv3 Username
NVIDIA recommends you use an SNMPv3 username and password instead of the read-only community; SNMPv3 does not expose the password in the GetRequest and GetResponse packets and can also encrypt packet contents. You can configure multiple usernames for different user roles with different levels of access to various MIBs.
The default snmpd.conf file contains the default user _snmptrapusernameX. You cannot use this username for authentication. SNMP traps require this username.
You can authenticate the user in the following ways:
With no authentication password (if you specify auth-none)
With an MD5 password
With an SHA password
The following example command requires no authentication password for the user testusernoauth:
cumulus@switch:~$ nv set service snmp-server username testusernoauth auth-none
cumulus@switch:~$ nv config apply
The following example command configures MD5 authentication for the user limiteduser1:
cumulus@switch:~$ nv set service snmp-server username testuserauth auth-md5 myauthmd5password
cumulus@switch:~$ nv config apply
The following example command configures SHA authentication for the user limiteduser1:
cumulus@switch:~$ nv set service snmp-server username limiteduser1 auth-sha SHApassword1
cumulus@switch:~$ nv config apply
If you specify MD5 or SHA authentication, you can also specify an AES or DES encryption password to encrypt the contents of the request and response packets.
cumulus@switch:~$ nv set service snmp-server username testuserauth auth-md5 myauthmd5password encrypt-aes myencryptsecret
cumulus@switch:~$ nv config apply
You can restrict a user to a particular OID tree. The OID can be either a string of decimal numbers separated by periods or a unique text string that identifies an SNMP MIB object. The MIBs that Cumulus Linux includes are in the /usr/share/snmp/mibs/ directory. If the MIB you want to use does not install by default, you can install it with the latest Debian snmp-mibs-downloader package.
cumulus@switch:~$ nv set service snmp-server username testuserauth auth-md5 myauthmd5password encrypt-aes myaessecret oid 1.3.6.1.2.1.1
cumulus@switch:~$ nv config apply
You can restrict a user to a predefined view:
cumulus@switch:~$ nv set service snmp-server username testuserauth auth-md5 myauthmd5password encrypt-aes myaessecret view rocket
cumulus@switch:~$ nv config apply
The example below defines five users, each with a different combination of authentication and encryption:
cumulus@switch:~$ nv set service snmp-server username user1 auth-none
cumulus@switch:~$ nv set service snmp-server username user2 auth-md5 user2password
cumulus@switch:~$ nv set service snmp-server username user3 auth-md5 user3password encrypt-des user3encryption
cumulus@switch:~$ nv set service snmp-server username user666 auth-sha user666password encrypt-aes user666encryption
cumulus@switch:~$ nv set service snmp-server username user999 auth-md5 user999password encrypt-des user999encryption
cumulus@switch:~$ nv set service snmp-server username user1 auth-none oid 1.3.6.1.2.1
cumulus@switch:~$ nv set service snmp-server username user3 auth-sha testshax encrypt-aes testaesx oid 1.3.6.1.2.1
cumulus@switch:~$ nv config apply
Three directives define an internal SNMPv3 username that you need for snmpd to retrieve information and send built-in traps or for traps you configure with the monitor command (see below):
createuser is the default SNMPv3 username.
iquerysecName is the default SNMPv3 username you use when making internal queries to retrieve monitored expressions, either to evaluate the monitored expression or build a notification payload. These internal queries always use SNMPv3, even if you query the agent using SNMPv1 or SNMPv2c. The iquerysecname directive only defines which user to use.
rouser is the username for these SNMPv3 queries.
Edit the /etc/snmp/snmpd.conf file and add the createuser, iquerysecName, rouser commands. The following example configuration configures snmptrapusernameX as the username using the createUser command.
The example below defines five users, each with a different combination of authentication and encryption:
cumulus@switch:~$ sudo nano /etc/snmp/snmpd.conf
...
# simple no auth user
#createuser user1
# user with MD5 authentication
#createuser user2 MD5 user2password
# user with MD5 for auth and DES for encryption
#createuser user3 MD5 user3password DES user3encryption
# user666 with SHA for authentication and AES for encryption
createuser user666 SHA user666password AES user666encryption
# user999 with MD5 for authentication and DES for encryption
createuser user999 MD5 user999password DES user999encryption
# restrict users to certain OIDs
# (Note: creating rouser or rwuser will give
# access regardless of the createUser command above. However,
# createUser without rouser or rwuser will not provide any access).
rouser user1 noauth 1.3.6.1.2.1
rouser user2 auth 1.3.6.1.2.1
rwuser user3 priv 1.3.6.1.2.1
rwuser user666
rwuser user999
...
The following example shows a more advanced but slightly more secure method of configuring SNMPv3 users without creating cleartext passwords:
Install the net-snmp-config script that is in the libsnmp-dev package:
Use the net-snmp-config command to create two users, one with MD5 and DES, and the next with SHA and AES.
The minimum password length is eight characters and the arguments -a and -x have different meanings in net-snmp-config than snmpwalk.
cumulus@switch:~$ sudo net-snmp-config --create-snmpv3-user -a md5authpass -x desprivpass -A MD5 -X DES userMD5withDES
cumulus@switch:~$ sudo net-snmp-config --create-snmpv3-user -a shaauthpass -x aesprivpass -A SHA -X AES userSHAwithAES
cumulus@switch:~$ sudo systemctl start snmpd.service
This adds a createUser command in /var/lib/snmp/snmpd.conf. Do not edit this file by hand unless you are removing usernames. You can edit this file and restrict access to certain parts of the MIB by adding noauth, auth or priv to allow unauthenticated access, require authentication, or to enforce use of encryption.
The snmpd daemon reads the information from the /var/lib/snmp/snpmd.conf file and then removes the line (so that Cumulus Linux does not store the master password for that user) and replaces it with the key it derives (using the EngineID). The key is a localized key so that if someone steals the password, they cannot use it to access other agents. To remove the two users userMD5withDES and userSHAwithAES, stop the snmpd daemon and edit the /var/lib/snmp/snmpd.conf file. Remove the lines containing the username, then restart the snmpd daemon as in step 3 above.
Configure an SNMP View Definition
To restrict MIB tree exposure, you can define a view for an SNMPv3 username or community password, and a host from a restricted subnet. In doing so, any SNMP request with that username and password must have a source IP address within the configured subnet.
You can define a specific view multiple times and fine tune to provide or restrict access using the included or excluded command to specify branches of certain MIB trees.
By default, the snmpd.conf file contains many views within the systemonly view.
cumulus@switch:~$ nv set service snmp-server viewname cumulusOnly included .1.3.6.1.4.1.40310
cumulus@switch:~$ nv set service snmp-server viewname cumulusCounters included .1.3.6.1.4.1.40310.2
cumulus@switch:~$ nv set service snmp-server readonly-community simplepassword access any view cumulusOnly
cumulus@switch:~$ nv set service snmp-server username testusernoauth auth-none view cumulusOnly
cumulus@switch:~$ nv set service snmp-server username limiteduser1 auth-md5 md5password1 encrypt-aes myaessecret view cumulusCounters
cumulus@switch:~$ nv config apply
Edit the /etc/snmp/snmpd.conf file and add the view command.
rocommunity uses the systemonly view to create a password that can only access these branches of the OID tree.
cumulus@switch:~$ sudo nano /etc/snmp/snmpd.conf
...
view systemonly included .1.3.6.1.2.1.1
view systemonly included .1.3.6.1.2.1.2
view systemonly included .1.3.6.1.2.1.3
...
Configure the Community String
Cumulus Linux disables snmpd authentication for SNMPv1 and SNMPv2c by default. To enable authentication, provide a password (community string) for SNMPv1 or SNMPv2c environments so that the snmpd daemon can respond to requests. By default, this provides access to the full OID tree for such requests, regardless of their source. Cumulus Linux does not set a default password so snmpd does not respond to any requests that arrive unless you set the read-only community password.
For SNMPv1 and SNMPv2c, you can specify a read-only community string. For SNMPv3, you can specify a read-only or a read-write community string (as long as you are not using the preferred username method; see above).
You can specify a source IP address token to restrict access to only that a host or network.
You can also specify a view to restrict the subset of the OID tree.
The following example configuration:
Sets the read-only community string to simplepassword for SNMP requests.
Restricts requests to only those that come from hosts in the 192.168.200.10/24 subnet.
Restricts viewing to the mysystem view, which you define with the view command.
cumulus@switch:~$ nv set service snmp-server viewname mysystem included 1.3.6.1.2.1.1
cumulus@switch:~$ nv set service snmp-server readonly-community simplepassword access 192.168.200.10/24 view mysystem
cumulus@switch:~$ nv config apply
This example creates a read-only community password showitall that allows access to the entire OID tree for requests originating from any source IP address.
cumulus@switch:~$ nv set service snmp-server readonly-community showitall access any
cumulus@switch:~$ nv config apply
To enable the community string, provide a community string, then set:
rocommunity or rwcommunity: rocommunity is for a read-only community; rwcommunity is for read-write access. Specify one or the other.
public: The plain text password or community string.
NVIDIA strongly recommends you change this password to something else.
default allows connections from any system.
localhost allows requests only from the local host. A restricted source can either be a specific hostname (or address), or a subnet, represented as IP/MASK (like 10.10.10.0/255.255.255.0), or IP/BITS (like 10.10.10.0/24), or the IPv6 equivalents.
-V restricts viewing to a specific view. For example, systemonly is one SNMP view. This is a user-defined value.
Edit the /etc/snmp/snmpd.conf file and add the community string.
In the following example, the first line sets the read-only community string to turtle for SNMP requests sourced from the 192.168.200.10/24 subnet and restricts viewing to the systemonly view defined with the -V option. The second line creates a read-only community string that allows access to the entire OID tree from any source IP address.
You can configure system settings for the SNMPv2 MIB. The following example commands set:
The system physical location for the node in the SNMPv2-MIB system table (the syslocation).
The username and email address of the contact person for this managed node (the syscontact).
An administratively assigned name for the managed node (the sysname).
To set the system physical location for the node in the SNMPv2-MIB system table:
cumulus@switch:~$ nv set service snmp-server system-location my-private-bunker
cumulus@switch:~$ nv config apply
To set the username and email address of the contact person for this managed node:
cumulus@switch:~$ nv set service snmp-server system-contact myemail@example.com
cumulus@switch:~$ nv config apply
To set an administratively assigned name for the managed node, run the following command. Typically, this is the fully qualified domain name of the node.
cumulus@switch:~$ nv set service snmp-server system-name CumulusBox-1,543,567
cumulus@switch:~$ nv config apply
Edit the /etc/snmp/snmpd.conf file and add the following configuration:
SNMP supports routing MIBs in FRR. If you are running Linux commands to configure the switch, you need to configure AgentX (ASX) access in FRR.
The NVUE nv set service snmp-server enable on command automatically configures AgentX (ASX) access in FRR; you do not need to run any additional commands.
The snmpd.conf file in Cumulus Linux does not include certain MIBs by default. This results in some default views on common network tools (like librenms) to return less than optimal data. To include more MIBs, enable the complete .1.3.6.1.2.1 range. The default SNMPv3 configuration includes:
ENTITY-MIB
ENTITY-SENSOR MIB
Parts of the BRIDGE-MIB and Q-BRIDGE-MIBs
This configuration grants access to a large number of MIBs, including all SNMPv2-MIB, which shows more data than you expect. In addition to being a security vulnerability, it consumes more CPU resources.
To enable the .1.3.6.1.2.1 range, make sure the view commands include the required MIB objects.
Set up the Custom MIBs on the NMS
You do not need to change the /etc/snmp/snmpd.conf file on the switch to support the custom MIBs. The file includes the following lines by default and provides support for both the Cumulus Counters and the Cumulus Resource Query MIBs.
The pass persist scripts in Cumulus Linux use the pass_persist extension to Net-SNMP. The scripts are in /usr/share/snmp and include:
bridge_pp.py
cl_drop_cntrs_pp.py
cl_poe_pp.py
entity_pp.py
entity_sensor_pp.py
ieee8023_lag_pp.py
resq_pp.py
snmpifAlias_pp.py
sysDescr_pass.py
vrf_bgpun_pp.py
Cumulus Linux enables all the scripts by default except for bgp4_pp.py, which FRR uses.
Disable SNMP
To disable SNMP, run the nv unset service snmp-server enable command:
cumulus@switch:~$ nv unset service snmp-server enable
cumulus@switch:~$ nv config apply
When you disable SNMP, the FRR service restarts, which might impact traffic.
Example Configuration
The following example configuration:
Enables an SNMP agent to listen on all IPv4 addresses with a community string password.
Sets the trap destination host IP address.
Creates several types of SNMP traps.
cumulus@switch:~$ nv set service snmp-server listening-address all
cumulus@switch:~$ nv set service snmp-server readonly-community tempPassword access any
cumulus@switch:~$ nv set service snmp-server trap-destination 1.1.1.1 community-password mypassword version 2c
cumulus@switch:~$ nv set service snmp-server trap-link-up check-frequency 15
cumulus@switch:~$ nv set service snmp-server trap-link-down check-frequency 10
cumulus@switch:~$ nv set service snmp-server trap-snmp-auth-failures
cumulus@switch:~$ nv config apply
Edit the /etc/snmp/snmpd.conf file and apply the following configuration (add every line starting with a +):
SNMP traps are alert notification messages from SNMP agents to the SNMP manager. These messages generate whenever any failure or fault occurs in a monitored device or service. An SNMPv3 inform is an acknowledged SNMPv3 trap.
You configure the following for SNMPv3 trap and inform messages:
The trap destination IP address; the VRF name is optional.
The authentication type and password. The encryption type and password are optional.
The engine ID and username pair for the Cumulus Linux switch sending the traps. The inform keyword specifies an inform message where the SNMP agent waits for an acknowledgement. You can find this at the end of the /var/lib/snmp/snmpd.conf file labeled oldEngineID. Configure this same engine ID/username (with authentication and encryption passwords) for the trap daemon receiving the trap to validate the received trap.
Generate Event Notification Traps
The Net-SNMP agent provides a method to generate SNMP trap events using the Distributed Management (DisMan) Event MIB for various system events, including:
Link up/down.
Exceeding the temperature sensor threshold, CPU load, or memory threshold.
Other SNMP MIBs.
To enable specific types of traps, create the following configurations in /etc/snmp/snmpd.conf.
Define Access Credentials
Although the traps are sent to an SNMPv2c receiver, the SNMPv3 username is still required to authorize the DisMan service. Starting with Net-SNMP 5.3, snmptrapd no longer accepts all traps by default. You must configure snmptrapd with authorized SNMPv1 and v2c community strings and, or SNMPv3 users. Non-authorized traps and informs are dropped.
If not already on the system, install the snmptrapd Debian package with the sudo apt-get install snmptrapd command before you configure the username.
Define Trap Receivers
The following configuration defines the trap receiver IP address for SNMPv1 and SNMPv2c traps. For SNMP versions 1 and 2c, you must set at least one SNMP trap destination IP address; multiple destinations can exist. Removing all settings disables SNMP traps. The default version is 2c. You must include a VRF name with the IP address to force traps to send in a non-default VRF table.
cumulus@switch:~$ nv set service snmp-server trap-destination localhost vrf rocket community-password mymanagementvrfpassword version 1
cumulus@switch:~$ nv set service snmp-server trap-destination localhost-v6 community-password mynotsosecretpassword version 2c
cumulus@switch:~$ nv config apply
To define the IP address of the notification (or trap) receiver for either SNMPv1 traps or SNMPv2 traps, use the trapsink (SNMPv1) trap2sink (SNMPv2c). Specifying more than one sink directive generates multiple copies of each notification (in the appropriate formats). You must configure a trap server to receive and decode these trap messages (for example, snmptrapd). You can configure the address of the trap receiver with a different protocol and port but this is most often left out. The defaults are to use the well-known UDP packets and port 162.
Edit the /etc/snmp/snmpd.conf file and configure the trap settings.
The SNMP trap receiving daemon must have usernames, authentication passwords, and encryption passwords created with its own EngineID. You must configure this trap server EngineID in the switch snmpd daemon sending the trap and inform messages.
cumulus@switch:~$ nv set service snmp-server trap-destination localhost username myv3user auth-md5 md5password1 encrypt-aes myaessecret engine-id 0x80001f888070939b14a514da5a00000000 inform
cumulus@switch:~$ nv set service snmp-server trap-destination localhost vrf mgmt username mymgmtvrfusername auth-md5 md5password2 encrypt-aes myaessecret2 engine-id 0x80001f888070939b14a514da5a00000000 inform
cumulus@switch:~$ nv config apply
You can configure SNMPv3 trap and inform messages with the trapsess configuration command. Inform messages are traps that the receiving trap daemon acknowledges. You configure inform messages with the -Ci parameter. You must specify the EngineID of the receiving trap server with the -e field.
When you run client SNMP programs (such as snmpget, snmpwalk, or snmptrap) from the command line, or when you configure snmpd to send a trap (based on snmpd.conf), you can configure a clientaddr in snmpd.conf that allows the SNMP client programs or snmpd (for traps) to source requests from a different source IP address.
For more information about clientaddr, see the snmpd.confman page.
snmptrap, snmpget, snmpwalk and snmpd itself must be able to bind to this address.
Edit the /etc/snmp/snmpd.conf file and add the clientaddr option. In the following example, spine01 is the client (IP address 192.168.200.21).
NVUE does not provide commands for this configuration.
Monitor Fans, Power Supplies, Temperature and Transformers
An SNMP agent (snmpd) waits for incoming SNMP requests and responds to them. If the agent does not receive any requests, it does not start any actions. However, various commands can configure snmpd to send traps according to preconfigured settings (load, file, proc, disk, or swap commands), or customized monitor directives.
See the snmpd.confman page for details on the monitor directive.
You can configure snmpd to monitor the operational status of either the Entity MIB or Entity-Sensor MIB by adding the monitor directive to the snmpd.conf file. After you know the OID, you can determine the operational status, which can be a value of ok(1), unavailable(2) or nonoperational(3). Add a configuration like the following example to /etc/snmp/snmpd.conf and adjust the values:
Use the entPhySensorOperStatus integer:
cumulus@switch:~$ sudo nano /etc/snmp/snmpd.conf
...
# without installing extra MIBS we can check the check Fan1 status
# if the Fan1 index is 100011001, monitor this specific OID (-I) every 10 seconds (-r), and defines additional information to be included in the trap (-o).
monitor -I -r 10 -o 1.3.6.1.2.1.47.1.1.1.1.7.100011001 "Fan1 Not OK" 1.3.6.1.2.1.99.1.1.1.5.100011001 > 1
# Any Entity Status non OK (greater than 1)
monitor -r 10 -o 1.3.6.1.2.1.47.1.1.1.1.7 "Sensor Status Failure" 1.3.6.1.2.1.99.1.1.1.5 > 1
Use the OID name. You can use the OID name if the snmp-mibs-downloader package is on the system (see below).
cumulus@switch:~$ sudo nano /etc/snmp/snmpd.conf
...
# for a specific fan called Fan1 with an index 100011001
monitor -I -r 10 -o entPhysicalName.100011001 "Fan1 Not OK" entPhySensorOperStatus.100011001 > 1
# for any Entity Status not OK ( greater than 1)
monitor -r 10 -o entPhysicalName "Sensor Status Failure" entPhySensorOperStatus > 1
You can find the entPhySensorOperStatus integer by walking the entPhysicalName table.
To get all sensor information, run snmpwalk on the entPhysicalName table. For example:
Cumulus Linux no longer uses the LM-SENSORS MIB to monitor temperature.
Configure Link Up and Link Down Notifications
You can configure the switch to trigger link up and link down notifications when the operational status of the link changes.
The following example commands enable the Disman Event MIB (.1.3.6.1.2.1.88.2.0.1) to monitor the ifTable for network interfaces that come up every 15 seconds or go down every 10 seconds, and trigger a CumulusLinkUp and CumulusLinkDown named notification.
The default check frequency is 60 seconds, with a minimum of 5 and a maximum of 300 seconds.
These notifications include the following information.
ifName
ifIndex
ifAdminStatus
ifOperStatus
cumulus@switch:~$ nv set service snmp-server trap-link-down check-frequency 10
cumulus@switch:~$ nv set service snmp-server trap-link-up check-frequency 15
cumulus@switch:~$ nv config apply
Edit the /etc/snmp/snmpd.conf file and configure the trap settings.
The following example commands enable the Disman Event MIB (.1.3.6.1.2.1.88.2.0.1) to monitor the ifTable for network interfaces that come up every 15 seconds or go down every 10 seconds, and trigger a CumulusLinkUp and CumulusLinkDown named notification.
These notifications include the following information.
NVUE does not provide commands to configure free memory notifications.
Configure Processor Load Notifications
To generate a trap when the CPU load average exceeds a certain threshold, run the following commands. You can only use integers or floating point numbers.
The following example generates a trap when the 1 minute interval reaches 12%, the 5 minute interval reaches 10%, or the 15 minute interval reaches 5%.
cumulus@switch:~$ nv set service snmp-server trap-cpu-load-average one-minute 12 five-minute 10 fifteen-minute 5
cumulus@switch:~$ nv config apply
Edit the /etc/snmp/snmpd.conf file and configure the CPU load settings. To monitor CPU load for 1, 5, or 15 minute intervals, use the load directive with the monitor directive.
To monitor disk utilization for all disks, use the includeAllDisks directive together with the monitor directive. The example code below generates a trap when a disk is 99% full:
To configure the Event MIB tables to monitor the various UCD-SNMP-MIB tables for problems (xxErrFlag column objects) and send a trap, add defaultMonitors yes to the snmpd.conf file and provide a configuration. You must first download the snmp-mibs-downloader Debian package and comment out the mibs line from the /etc/snmp/snmpd.conf file (see below). Then add a configuration like the following example:
You can use MIB names instead of OIDs, which greatly improves the readability of the snmpd.conf file. You enable this by installing the snmp-mibs-downloader, which downloads SNMP MIBs to the switch before enabling traps.
Open /etc/apt/sources.list in a text editor, add the non-free repository, then save the file:
cumulus@switch:~$ sudo nano /etc/apt/sources.list
...
deb http://deb.debian.org/debian buster main non-free
...
Open the /etc/snmp/snmp.conf file to verify that the mibs : line is in comments:
#
# As the snmp packages come without MIB files due to license reasons, loading
# of MIBs is disabled by default. If you added the MIBs you can reenable
# loading them by commenting out the following line.
#mibs :
Open the /etc/default/snmpd file to verify that the export MIBS= line is in comments:
# This file controls the activity of snmpd and snmptrapd
# Don't load any MIBs by default.
# You might comment this lines after you have the MIBs Downloaded.
#export MIBS=
After you confirm the configuration, remove or comment out the non-free repository in /etc/apt/sources.list.
#deb http://ftp.us.debian.org/debian/ buster main non-free
Configure Incoming SNMP Traps
The Net-SNMP trap daemon in /etc/snmp/snmpd.confreceives SNMP traps. You configure how incoming traps process in the /etc/snmp/snmptrapd.conf file. With Net-SNMP release 5.3 and later, you must specify who is authorized to send traps and informs to the notification receiver (and what types of processing these are allowed to trigger). You can specify three processing types:
log logs the details of the notification in a specified file to standard output (or stderr), or through syslog (or similar).
execute passes the details of the trap to a specified handler program, including embedded Perl.
net forwards the trap to another notification receiver.
Typically, you configure all three — log,execute,net — to cover any style of processing for a particular category of notification. You can limit certain notification sources to certain processing only.
authCommunity TYPES COMMUNITY [SOURCE [OID | -v VIEW ]] authorizes traps and SNMPv2c INFORM requests with the community you specify to trigger the types of processing you list. By default, this allows any notification using this community to process. You can use the SOURCE field to specify that the configuration only applies to notifications from particular sources. For more information about specific configuration options within the file, see snmptrapd.conf(5) man page with the man 5 snmptrapd.conf command.
If not already on the system, install the snmptrapd Debian package before you configure incoming traps:
cumulus@switch:~$ sudo apt-get install snmptrapd
Supported MIBs
Below are the MIBs that Cumulus Linux supports, as well as suggested uses for them. The /usr/share/snmp/mibs/Cumulus-Snmp-MIB.txt file defines the overall Cumulus Linux MIB.
Discard counters: Cumulus Linux also includes its own counters MIB, defined in /usr/share/snmp/mibs/Cumulus-Counters-MIB.txt. It has the OID .1.3.6.1.4.1.40310.2.
Cumulus Linux includes its own resource utilization MIB, which is similar to using cl-resource-query. This MIB monitors layer 3 entries by host, route, nexthops, ECMP groups, and layer 2 MAC/BDPU entries. /usr/share/snmp/mibs/Cumulus-Resource-Query-MIB.txt defines this MIB, which has the OID .1.3.6.1.4.1.40310.1.
The dot1dBasePortEntry and dot1dBasePortIfIndex tables in the BRIDGE-MIB and dot1qBase, dot1qFdbEntry, dot1qTpFdbEntry, dot1qTpFdbStatus, and dot1qVlanStaticName tables in the Q-BRIDGE-MIB tables. You must uncomment the bridge_pp.py pass_persist script in /etc/snmp/snmpd.conf.
Implementation of the IEEE 8023-LAG-MIB includes the dot3adAggTable and dot3adAggPortListTable tables. To enable this, edit /etc/snmp/snmpd.conf and uncomment or add the following lines:
view systemonly included .1.2.840.10006.300.43 pass_persist .1.2.840.10006.300.43 /usr/share/snmp/ieee8023_lag_pp.py
Note: Cumulus Linux disables the IF-MIB cache by default. The non-caching code path in the IF-MIB treats 64-bit counters like 32-bit counters (a 64-bit counter rolls over after the value increments to a value that extends beyond 32 bits). To enable the counter to reflect traffic statistics using 64-bit counters, remove the -y option from the SNMPDOPTS line in the /etc/default/snmpd file. The example below first shows the original line, commented out, then the modified line without the -y option:
Layer 2 neighbor information from lldpd (you need to enable the SNMP subagent in LLDP). You need to start lldpd with the -x option to enable connectivity to snmpd(AgentX).
Due to licensing restrictions, Cumulus Linux does not install all MIBs. For the MIBs that Cumulus Linux does not install, you must add the “non-free” archive to /etc/apt/sources.list. To see which MIBs are on your switch, run ls /usr/share/snmp/mibs/.
To install more MIBs, install snmp-mibs-downloader, then either remove or comment out the “non-free” repository in /etc/apt/sources.list. Refer to Enable MIB-to-OID Translation.
Use the following commands to troubleshoot potential SNMP issues.
To show a summary of the SNMP configuration settings on the switch:
cumulus@switch:~$ nv show service snmp-server
applied description
------------------- -------------- ---------------------------------------------------------------------
enable on Turn the feature 'on' or 'off'. This feature is disabled by default.
[listening-address] localhost Collection of listening addresses
trap-link-down
check-frequency 60 Link up or link down checking frequency in seconds
trap-link-up
check-frequency 60 Link up or link down checking frequency in seconds
[username] testusernoauth Usernames
[username] user1
[username] user2
[username] user3
[username] user666
[username] user999
To show a summary of the SNMP configuration settings in json format, run the nv show service snmp-server --output json --applied command.
To show the SNMP trap CPU load average, run the nv show service snmp-server trap-cpu-load-average command.
To show SNMP trap authentication failures, run the nv show service snmp-server trap-snmp-auth-failures command.
To see all the show commands for SNMP troubleshooting, run nv show service snmp-server and press the Tab key:
cumulus@switch:~$ nv show service snmp-server <<press Tab>>
listening-address readonly-community-v6 trap-link-down username
mibs trap-cpu-load-average trap-link-up viewname
readonly-community trap-destination trap-snmp-auth-failures
Single User Mode - Password Recovery
Use single user mode to assist in troubleshooting system boot issues or for password recovery.
To enter single user mode:
Boot the switch, then as soon as you see the GRUB menu, use the arrow keys to select Advanced options for Cumulus Linux GNU/Linux.
Before the GRUB menu appears, the switch goes through the boot cycle. Do not interrupt this autoboot process when you see the following lines; wait until you see the GRUB menu.
...
USB0: Bringing USB2 host out of reset...
Net: eth-0
SF: MX25L6405D with page size 4 KiB, total 8 MiB
Hit any key to stop autoboot: 2
GNU GRUB version 2.02+dfsg1-20
+----------------------------------------------------------------------------+
|*Cumulus Linux GNU/Linux |
| Advanced options for Cumulus Linux GNU/Linux |
| ONIE |
| |
+----------------------------------------------------------------------------+
Select Cumulus Linux GNU/Linux, with Linux 4.19.0-cl-1-amd64 (recovery mode).
GNU GRUB version 2.02+dfsg1-20
+----------------------------------------------------------------------------+
| Cumulus Linux GNU/Linux, with Linux 4.19.0-cl-1-amd64 |
|*Cumulus Linux GNU/Linux, with Linux 4.19.0-cl-1-amd64 (recovery mode) |
| |
+----------------------------------------------------------------------------+
After the system reboots, set a new root password. The root user provides complete control over the switch.
root@switch:~# passwd
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
You can take this opportunity to reset the password for the cumulus account.
root@switch:~# passwd cumulus
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Sync the /etc directory, then reboot the system:
root@switch:~# sync
root@switch:~# reboot -f
Restarting the system.
Resource Diagnostics Using cl-resource-query
You can use the cl-resource-query command to retrieve information about host entries, MAC entries, layer 2 and layer 3 routes, and ECMP routes that are in use. Because Cumulus Linux synchronizes routes between the kernel and the switching silicon, if the required resource pools in hardware fill up, new kernel routes can cause existing routes to move from being fully allocated to being partially allocated. To avoid this, monitor the routes in the hardware to keep them below the ASIC limits.
To monitor the routes in Cumulus Linux hardware, use the cl-resource-query command.
The example below shows cl-resource-query results for an NVIDIA Spectrum-2 switch:
cumulus@switch:~$ sudo cl-resource-query
IPv4 host entries: 0, 0% of maximum value 41360
IPv6 host entries: 0, 0% of maximum value 20680
IPv4 neighbors: 0
IPv6 neighbors: 0
IPv4 route entries: 0, 0% of maximum value 82720
IPv6 route entries: 22, 0% of maximum value 74446
IPv4 Routes: 0
IPv6 Routes: 12
Total Routes: 22, 0% of maximum value 157166
Unicast Adjacency entries: 0, 0% of maximum value 33087
ECMP entries: 0, 0% of maximum value 8571
MAC entries: 38, 0% of maximum value 57903
Total Mcast Routes: 0, 0% of maximum value 1000
Ingress ACL entries: 0
Egress ACL entries: 0
ACL Regions: 4, 1% of maximum value 400
ACL 18B Rules Key: 1, 0% of maximum value 57476
ACL 36B Rules Key: 0, 0% of maximum value 57475
ACL 54B Rules Key: 0, 0% of maximum value 34485
Ingress ACL mac filter table: 0 18B : 0 36B : 0 54B : 0
Ingress ACL ipv4 filter table: 0 18B : 0 36B : 0 54B : 0
Ingress ACL ipv6 filter table: 0 18B : 0 36B : 0 54B : 0
Egress ACL mac filter table: 0 18B : 0 36B : 0 54B : 0
Egress ACL ipv4 filter table: 0 18B : 0 36B : 0 54B : 0
Egress ACL ipv6 filter table: 0 18B : 0 36B : 0 54B : 0
Ingress ACL ipv4 mangle table: 0 18B : 0 36B : 0 54B : 0
Ingress ACL ipv6 mangle table: 0 18B : 0 36B : 0 54B : 0
Ingress PBR ipv4 filter table: 0 18B : 0 36B : 0 54B : 0
Ingress PBR ipv6 filter table: 0 18B : 0 36B : 0 54B : 0
Flow Counters: 2, 0% of maximum value 39430
RIF Basic Counters: 0, 0% of maximum value 7885
RIF Enhanced Counters: 38, 1% of maximum value 2666
Dynamic SNAT entries: 0, 0% of maximum value 1024
Dynamic DNAT entries: 0, 0% of maximum value 1024
Dynamic Config SNAT entries: 0, 0% of maximum value 64
Dynamic Config DNAT entries: 0, 0% of maximum value 64
Ingress ACL and Egress ACL entries show the counts in single wide (not double-wide). For information about ACL entries, see Estimate the Number of ACL Rules.
ASIC Monitoring
Cumulus Linux provides an ASIC monitoring tool that collects and distributes data about the state of the ASIC. The monitoring tool polls for data at specific intervals and takes certain actions so that you can identify and respond to problems, such as:
Microbursts that result in longer packet latency
Packet buffer congestion that might lead to packet drops
Network problems with a particular switch, port, or traffic class
You can collect the following type of statistics with the ASIC monitoring tool:
A fine-grained history of queue lengths using histograms maintained by the ASIC
Packet counts per port, priority and size
Dropped packet, pause frame, and ECN-marked packet counts
Buffer congestion occupancy per port, priority and buffer pool, and at input and output ports
Collecting Queue Lengths in Histograms
The NVIDIA Spectrum ASIC provides a mechanism to measure and report egress queue lengths in histograms (a graphical representation of data, which it divides into intervals or bins). You can configure the ASIC to measure up to 64 egress queues. Each queue reports through a histogram with 10 bins, where each bin represents a range of queue lengths.
You configure the histogram with a minimum size boundary (Min) and a histogram size. You then derive the maximum size boundary (Max) by adding the minimum size boundary and the histogram size.
The 10 bins have numbers 0 through 9. Bin 0 represents queue lengths up to the Min specified, including queue length 0. Bin 9 represents queue lengths of Max and above. Bins 1 through 8 represent equal-sized ranges between the Min and Max (by dividing the histogram size by 8).
For example, consider the following histogram queue length ranges, in bytes:
Min = 960
Histogram size = 12288
Max = 13248
Range size = 1536
Bin 0: 0:959
Bin 1: 960:2495
Bin 2: 2496:4031
Bin 3: 4032:5567
Bin 4: 5568:7103
Bin 5: 7104:8639
Bin 6: 8640:10175
Bin 7: 10176:11711
Bin 8: 11712:13247
Bin 9: 13248:*
The following illustration demonstrates a histogram showing how many times the queue length for a port was in the ranges specified by each bin. The example shows that the queue length was between 960 and 2495 bytes 125 times within one second.
Configure ASIC Monitoring
The asic-monitor service manages the ASIC monitoring tool (systemd manages the asic-monitor service). The asic-monitor service reads the /etc/cumulus/datapath/monitor.conf configuration file to determine what statistics to collect and when to trigger. The service always starts; however, if the configuration file is empty, the service exits.
The monitor.conf configuration file provides the following information:
The type of data to collect.
The switch ports to monitor.
How and when to start reading the ASIC (when the switch reaches a specific queue length or number of dropped packets).
What actions to take (create a snapshot file, send a message to the /var/log/syslog file, or collect more data).
To configure ASIC monitoring, edit the /etc/cumulus/datapath/monitor.conf file and restart the asic-monitor service. The asic-monitor service reads the new configuration file and then runs until you stop it.
The following procedure describes how to monitor queue lengths using a histogram. The monitor collects data every second and writes the results to a snapshot file. When the size of the queue reaches 500 bytes, the system sends a message to the /var/log/syslog file.
To monitor queue lengths using a histogram:
Open the /etc/cumulus/datapath/monitor.conffile in a text editor.
At the end of the file, add the following line to specify the name of the histogram monitor (port group). The example uses histogram_pg; however, you can use any name you choose. You must use the same name with all histogram settings.
monitor.port_group_list = [histogram_pg]
Add the following line to specify the ports you want to monitor. The following example sets swp1 through swp50.
monitor.histogram_pg.port_set = swp1-swp50
Add the following line to set the data type to histogram. This is the data type for histogram monitoring.
monitor.histogram_pg.stat_type = histogram
Add the following line to set the trigger type to timer. The only trigger type available is timer.
monitor.histogram_pg.trigger_type = timer
Add the following line to set the frequency at which data collection starts. In the following example, the frequency is one second.
monitor.histogram_pg.timer = 1s
Add the following line to set the actions you want to take after collecting data. In the following example, the system writes the results of data collection to a snapshot file and sends a message to the /var/log/syslog file.
monitor.histogram_pg.action_list = [snapshot,log]
Add the following line to specify a name and location for the snapshot file. In the following example, the system writes the snapshot to a file called histogram_stats in the /var/lib/cumulus directory and adds a suffix to the file name with the snapshot file count (see the following step).
Add the following line to set the number of snapshots that are taken before the system starts overwriting the earliest snapshot files. In the following example, because the snapshot file count is set to 64, the first snapshot file is named histogram_stats_0 and the 64th snapshot is named histogram_stats_63. When the 65th snapshot is taken, the original snapshot file (histogram_stats_0) is overwritten and the sequence continues until histogram_stats_63 is written. Then, the sequence restarts.
monitor.histogram_pg.snapshot.file_count = 64
Add the following line to include a threshold, which determines how to collect data. Setting a threshold is optional. In the following example, when the size of the queue reaches 500 bytes, the system sends a message to the /var/log/syslog file.
monitor.histogram_pg.log.queue_bytes = 500
Add the following lines to set the size, minimum boundary, and sampling time of the histogram. Adding the histogram size and the minimum boundary size together produces the maximum boundary size. These settings represent the range of queue lengths per bin.
Restarting the asic-monitor service does not disrupt traffic or require you to restart switchd. The switch enables the service by default during boot and the service restarts when you restart switchd.
When collecting data, the switch uses both the CPU and SDK process, which can affect switchd. Snapshots and logs can occupy a lot of disk space if you do not limit their number.
To collect other data, such as all packets per port, buffer congestion, or packet drops due to error, follow the procedure above but change the port group list setting to include the port group name you want to use. For example, to monitor packet drops due to buffer congestion:
Certain settings in the procedure above (such as the histogram size, boundary size, and sampling time) only apply to the histogram monitor. For a description of all ASIC monitor settings, refer to ASIC Monitoring.
ASIC Monitoring Settings
The following table describes the ASIC monitor settings.
Setting
Description
port_group_list
Specifies the names of the monitors (port groups) you want to use to collect data, such as discards_pg, histogram_pg, all_packet_pg, buffers_pg. You can provide any name you want for the port group; the names above are just examples. You must use the same name for all the settings of a particular port group.
Note: You must specify at least one port group. If the port group list is empty, systemd shuts down the asic-monitor service.
<port_group_name>.port_set
Specifies the range of ports monitored. You can specify GLOBs and comma-separated lists; for example, swp1 swp4,swp8,swp10-swp50.
Example:
monitor.histogram_pg.port_set = swp1-swp50
<port_group_name>.stat_type
Specifies the type of data that the port group collects.
For histograms, specify histogram. For example:
monitor.histogram_pg.stat_type = histogram
For packet drops due to errors, specify packet. For example:
monitor.discards_pg.stat_type = packet
For packet occupancy statistics, specify buffer. For example:
monitor.buffers_pg.stat_type = buffer
For all packets per port, specify packet_all. Example:
monitor.all_packet_pg.stat_type = packet_all
<port_group_name>.cos_list
For histogram monitoring, each CoS (Class of Service) value in the list has its own histogram on each port. The global limit on the number of histograms is an average of one histogram per port.
Example:
monitor.histogram_pg.cos_list = [0]
<port_group_name>.trigger_type
Specifies the type of trigger that initiates data collection. The only option is timer. At least one port group must have a configured timer, otherwise no data is ever collected.
Example:
monitor.histogram_pg.trigger_type = timer
<port_group_name>.timer
Specifies the frequency at which data collects; for example, a setting of 1s indicates that data collects one time per second. You can set the timer to the following: 1 to 60 seconds: 1s, 2s, and so on up to 60s 1 to 60 minutes: 1m, 2m, and so on up to 60m 1 to 24 hours: 1h, 2h, and so on up to 24h 1 to 7 days: 1d, 2d and so on up to 7d
Example:
monitor.histogram_pg.timer = 4s
<port_group_name>.action_list
Specifies one or more actions that occur when data collects: snapshot writes a snapshot of the data collection results to a file. If you specify this action, you must also specify a snapshot file (described below). You can also specify a threshold that initiates the snapshot action.
Note: If an action appears in the action list but does not have the required settings (such as a threshold for the log action), the ASIC monitor stops and reports an error.
<port_group_name>.snapshot.file
Specifies the name for the snapshot file. All snapshots use this name, with a sequential number appended to it. See the snapshot.file_count setting.
Specifies the number of snapshots that can be created before the first snapshot file is overwritten. In the following example, because the snapshot file count is set to 64, the first snapshot file is named histogram_stats_0 and the 64th snapshot is named histogram_stats_63. When the 65th snapshot is taken, the original snapshot file (histogram_stats_0) is overwritten and the sequence restarts.
Example:
monitor.histogram_pg.snapshot.file_count = 64
Note: While more snapshots provide you with more data, they can occupy a lot of disk space on the switch.
<port_group_name>.<action>.queue_bytes
For histogram monitoring.
Specifies a threshold for the histogram monitor. This is the length of the queue in bytes that initiates a specified action (snapshot, log, collect).
Specifies a threshold for the packet drops due to error monitor. This is the number of packet drops due to error that initiates a specified action (snapshot, log, collect).
For monitoring packet drops due to buffer congestion.
Specifies a threshold for the packet drops due to buffer congestion monitor. This is the number of packet drops due to buffer congestion that initiates a specified action (log or collect).
The minimum boundary size for the histogram in bytes. On a Spectrum switch, this number must be a multiple of 96. Adding this number to the size of the histogram produces the maximum boundary size. These values represent the range of queue lengths per bin.
The size of the histogram in bytes. Adding this number and the minimum_bytes_boundary value together produces the maximum boundary size. These values represent the range of queue lengths per bin.
Packet drops on swp1 through swp50 collect every two seconds.
If the number of packet drops is greater than 100, the results write to the /var/lib/cumulus/discard_stats snapshot file and the system sends a message to the /var/log/syslog file.
A collect action triggers the collection of additional information. You can daisy chain multiple monitors (port groups) into a single collect action.
In the following example:
Queue length histograms collect for swp1 through swp50 every second.
The results write to the /var/lib/cumulus/histogram_stats snapshot file.
When the queue length reaches 500 bytes, the system sends a message to the /var/log/syslog file and collects additional data; buffer occupancy and all packets per port.
Buffer occupancy data writes to the /var/lib/cumulus/buffer_stats snapshot file and all packets per port data writes to the /var/lib/cumulus/all_packet_stats snapshot file.
In addition, packet drops on swp1 through swp50 collect every two seconds. If the number of packet drops is greater than 100, the monitor writes the results to the /var/lib/cumulus/discard_stats snapshot file and sends a message to the /var/log/syslog file.
Certain actions require additional settings. For example, if you specify the snapshot action, a snapshot file is also required. If you specify the log action, a log threshold is also required. See action_list for additional settings required for each action.
Example Snapshot File
A snapshot action writes a snapshot of the current state of the ASIC to a file. Because parsing the file and finding the information can be tedious, you can use a third-party analysis tool to analyze the data in the file. The following example shows a snapshot of queue lengths.
A log action writes out the ASIC state to the /var/log/syslog file. In the following example, when the size of the queue reaches 500 bytes, the system sends this message to the /var/log/syslog file:
2018-02-26T20:14:41.560840+00:00 cumulus asic-monitor-module INFO: 2018-02-26 20:14:41.559967: Egress queue(s) greater than 500 bytes in monitor port group histogram_pg.
Monitoring Best Practices
The following monitoring processes are best practices for reviewing and troubleshooting potential issues with Cumulus Linux environments.
This document describes:
Metrics that you can poll from Cumulus Linux and use in trend analysis
Critical log messages that you can monitor for triggered alerts
Trend Analysis Using Metrics
A metric is a quantifiable measure that tracks and assesses the status of a specific infrastructure component. Examples of metrics include bytes on an interface, CPU utilization, and total number of routes.
Metrics are more valuable when you use them for trend analysis.
Generate Alerts with Triggered Logging
Cumulus Linux typically sends triggered issues to syslog, but can send issues to another log file depending on the feature. rsyslog handles all logging, including local and remote logging. Logs are the best method to use for generating alerts when the system transitions from a stable steady state.
Sending logs to a centralized collector, then creating alerts that you base on critical logs is an optimal solution.
Log Formatting
Most log files in Cumulus Linux use a standard presentation format. For example:
2017-03-08T06:26:43.569681+00:00 leaf01 sysmonitor: Critically high CPU use: 99%
2017-03-08T06:26:43.569681+00:00 is the timestamp.
leaf01 is the hostname.
sysmonitor is the process that is the source of the message.
Critically high CPU use: 99% is the message.
For brevity and legibility, this section omits the timestamp and hostname from examples.
Hardware
The smond process provides monitoring for various switch hardware elements. Minimum or maximum values depend on the flags you apply to the basic command. The table below lists the hardware elements and applicable commands and flags.
Hardware Element
Monitoring Commands
Interval Poll
Temperature
smonctl -j smonctl -j -s TEMP[X]
10 seconds
Fan
smonctl -j smonctl -j -s FAN[X]
10 seconds
PSU
smonctl -j smonctl -j -s PSU[X]
10 seconds
PSU Fan
smonctl -j smonctl -j -s PSU[X]Fan[X]
10 seconds
PSU Temperature
smonctl -j smonctl -j -s PSU[X]Temp[X]
10 seconds
Voltage
smonctl -j smonctl -j -s Volt[X]
10 seconds
Front Panel LED
ledmgrd -d ledmgrd -j
5 seconds
Not all switch models include a sensor for monitoring power consumption and voltage. See this note for details.
Hardware Logs
Log Location
Log Entries
High temperature
/var/log/syslog
/usr/sbin/smond : : Temp1(Board Sensor near CPU): state changed from UNKNOWN to OK /usr/sbin/smond : : Temp2(Board Sensor Near Virtual Switch): state changed from UNKNOWN to OK /usr/sbin/smond : : Temp3(Board Sensor at Front Left Corner): state changed from UNKNOWN to OK /usr/sbin/smond : : Temp4(Board Sensor at Front Right Corner): state changed from UNKNOWN to OK /usr/sbin/smond : : Temp5(Board Sensor near Fan): state changed from UNKNOWN to OK
Fan speed issues
/var/log/syslog
/usr/sbin/smond : : Fan1(Fan Tray 1, Fan 1): state changed from UNKNOWN to OK /usr/sbin/smond : : Fan2(Fan Tray 1, Fan 2): state changed from UNKNOWN to OK /usr/sbin/smond : : Fan3(Fan Tray 2, Fan 1): state changed from UNKNOWN to OK /usr/sbin/smond : : Fan4(Fan Tray 2, Fan 2): state changed from UNKNOWN to OK /usr/sbin/smond : : Fan5(Fan Tray 3, Fan 1): state changed from UNKNOWN to OK /usr/sbin/smond : : Fan6(Fan Tray 3, Fan 2): state changed from UNKNOWN to OK
PSU failure
/var/log/syslog
/usr/sbin/smond : : PSU1Fan1(PSU1 Fan): state changed from UNKNOWN to OK /usr/sbin/smond : : PSU2Fan1(PSU2 Fan): state changed from UNKNOWN to BAD
System Data
Cumulus Linux includes several ways to monitor system data. In addition, you can receive alerts in high risk situations.
CPU Idle Time
When a CPU reports five high CPU alerts within a span of five minutes, the switch logs an alert.
Short bursts of high CPU can occur during switchd churn or routing protocol startup. Do not set alerts for these short bursts.
System Element
Monitoring Commands
Interval Poll
CPU utilization
sudo cat /proc/stat top -b -n 1
30 seconds
CPU Logs
Log Location
Log Entries
High CPU
/var/log/syslog
sysmonitor: Critically high CPU use: 99% systemd[1]: Starting Monitor system resources (cpu, memory, disk)… systemd[1]: Started Monitor system resources (cpu, memory, disk). sysmonitor: High CPU use: 89% systemd[1]: Starting Monitor system resources (cpu, memory, disk)… systemd[1]: Started Monitor system resources (cpu, memory, disk). sysmonitor: CPU use no longer high: 77%
Cumulus Linux monitors CPU, memory, and disk space with sysmonitor. The configurations for the thresholds are in /etc/cumulus/sysmonitor.conf. For more information, see man sysmonitor.
CPU measure
Thresholds
Use
Alert: 90% Crit: 95%
Process Load
Alarm: 95% Crit: 125%
Disk Usage
When monitoring disk utilization, you can exclude tmpfs from monitoring.
System Element
Monitoring Commands
Interval Poll
Disk utilization
/bin/df -x tmpfs
300 seconds
Process Restart
In Cumulus Linux, systemd monitors and restarts processes.
Process Element
Monitoring Commands
View processes that systemd monitors
systemctl status
Layer 1 Protocols and Interfaces
Link and port state interface transitions log to /var/log/syslog and /var/log/switchd.log.
Interface Element
Monitoring Commands
Link state
sudo cat /sys/class/net/[iface]/operstate nv show interface --view=brief
Link speed
sudo cat /sys/class/net/[iface]/speed nv show interface --view=brief
Port state
ip link show nv show interface --view=brief
Bond state
sudo cat /proc/net/bonding/[bond] nv show interface --view=brief
You obtain interface counters from either querying the hardware or the Linux kernel. The Linux kernel aggregates the output from the hardware.
switchd[5692]: nic.c:213 nic_set_carrier: swp17: setting kernel carrier: down switchd[5692]: netlink.c:291 libnl: swp1, family 0, ifi 20, oper down switchd[5692]: nic.c:213 nic_set_carrier: swp1: setting kernel carrier: up switchd[5692]: netlink.c:291 libnl: swp17, family 0, ifi 20, oper up
Unidirectional link
/var/log/switchd.log /var/log/ptm.log
ptmd[7146]: ptm_bfd.c:2471 Created new session 0x1 with peer 10.255.255.11 port swp1 ptmd[7146]: ptm_bfd.c:2471 Created new session 0x2 with peer fe80::4638:39ff:fe00:5b port swp1 ptmd[7146]: ptm_bfd.c:2471 Session 0x1 down to peer 10.255.255.11, Reason 8 ptmd[7146]: ptm_bfd.c:2471 Detect timeout on session 0x1 with peer 10.255.255.11, in state 1
Bond Negotiation Working
/var/log/syslog
kernel: [85412.763193] bonding: bond0 is being created… kernel: [85412.770014] bond0: Enslaving swp2 as a backup interface with an up link kernel: [85412.775216] bond0: Enslaving swp1 as a backup interface with an up link kernel: [85412.797393] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready kernel: [85412.799425] IPv6: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
Bond Negotiation Failing
/var/log/syslog
kernel: [85412.763193] bonding: bond0 is being created… kernel: [85412.770014] bond0: Enslaving swp2 as a backup interface with an up link kernel: [85412.775216] bond0: Enslaving swp1 as a backup interface with an up link kernel: [85412.797393] IPv6: ADDRCONF(NETDEV_UP): bond0: link is not ready
MLAG peerlink negotiation Working
/var/log/syslog
lldpd[998]: error while receiving frame on swp50: Network is down lldpd[998]: error while receiving frame on swp49: Network is down kernel: [76174.262893] peerlink: Setting ad_actor_system to 44:38:39:00:00:11 kernel: [76174.264205] 8021q: adding VLAN 0 to HW filter on device peerlink mstpd: one_clag_cmd: setting (1) peer link: peerlink mstpd: one_clag_cmd: setting (1) clag state: up mstpd: one_clag_cmd: setting system-mac 44:38:39:ff:40:94 mstpd: one_clag_cmd: setting clag-role secondary
/var/log/clagd.log
clagd[14003]: Cleanup is executing. clagd[14003]: Cannot open file “/tmp/pre-clagd.q7XiO clagd[14003]: Cleanup is finished clagd[14003]: Beginning execution of clagd version 1 clagd[14003]: Invoked with: /usr/sbin/clagd –daemon clagd[14003]: Role is now secondary clagd[14003]: HealthCheck: role via backup is second clagd[14003]: HealthCheck: backup active clagd[14003]: Initial config loaded clagd[14003]: The peer switch is active. clagd[14003]: Initial data sync from peer done. clagd[14003]: Initial handshake done. clagd[14003]: Initial data sync to peer done.
MLAG peerlink negotiation Failing
/var/log/syslog
lldpd[998]: error while receiving frame on swp50: Network is down lldpd[998]: error while receiving frame on swp49: Network is down kernel: [76174.262893] peerlink: Setting ad_actor_system to 44:38:39:00:00:11 kernel: [76174.264205] 8021q: adding VLAN 0 to HW filter on device peerlink mstpd: one_clag_cmd: setting (1) peer link: peerlink mstpd: one_clag_cmd: setting (1) clag state: down mstpd: one_clag_cmd: setting system-mac 44:38:39:ff:40:94 mstpd: one_clag_cmd: setting clag-role secondary
/var/log/clagd.log
clagd[26916]: Cleanup is executing. clagd[26916]: Cannot open file “/tmp/pre-clagd.6M527vvGX0/brbatch” for reading: No such file or directory clagd[26916]: Cleanup is finished clagd[26916]: Beginning execution of clagd version 1.3.0 clagd[26916]: Invoked with: /usr/sbin/clagd –daemon 169.254.1.2 peerlink.4094 44:38:39:FF:01:01 –priority 1000 –backupIp 10.0.0.2 clagd[26916]: Role is now secondary clagd[26916]: Initial config loaded
MLAG port negotiation Working
/var/log/syslog
kernel: [77419.112195] bonding: server01 is being created… lldpd[998]: error while receiving frame on swp1: Network is down kernel: [77419.122707] 8021q: adding VLAN 0 to HW filter on device swp1 kernel: [77419.126408] server01: Enslaving swp1 as a backup interface with a down link kernel: [77419.177175] server01: Setting ad_actor_system to 44:38:39:ff:40:94 kernel: [77419.190874] server01: Warning: No 802.3ad response from the link partner for any adapters in the bond kernel: [77419.191448] IPv6: ADDRCONF(NETDEV_UP): server01: link is not ready kernel: [77419.191452] 8021q: adding VLAN 0 to HW filter on device server01 kernel: [77419.192060] server01: link status definitely up for interface swp1, 1000 Mbps full duplex kernel: [77419.192065] server01: now running without any active interface! kernel: [77421.491811] IPv6: ADDRCONF(NETDEV_CHANGE): server01: link becomes ready mstpd: one_clag_cmd: setting (1) mac 44:38:39:00:00:17 <server01, None>
/var/log/clagd.log
clagd[14003]: server01 is now dual connected.
MLAG port negotiation Failing
/var/log/syslog
kernel: [79290.290999] bonding: server01 is being created… kernel: [79290.299645] 8021q: adding VLAN 0 to HW filter on device swp1 kernel: [79290.301790] server01: Enslaving swp1 as a backup interface with a down link kernel: [79290.358294] server01: Setting ad_actor_system to 44:38:39:ff:40:94 kernel: [79290.373590] server01: Warning: No 802.3ad response from the link partner for any adapters in the bond kernel: [79290.374024] IPv6: ADDRCONF(NETDEV_UP): server01: link is not ready kernel: [79290.374028] 8021q: adding VLAN 0 to HW filter on device server01 kernel: [79290.375033] server01: link status definitely up for interface swp1, 1000 Mbps full duplex kernel: [79290.375037] server01: now running without any active interface!
/var/log/clagd.log
clagd[14291]: Conflict (server01): matching clag-id (1) not configured on peer… clagd[14291]: Conflict cleared (server01): matching clag-id (1) detected on peer
MLAG port negotiation Flapping
/var/log/syslog
mstpd: one_clag_cmd: setting (0) mac 00:00:00:00:00:00 <server01, None> mstpd: one_clag_cmd: setting (1) mac 44:38:39:00:00:03 <server01, None>
/var/log/clagd.log
clagd[14291]: server01 is no longer dual connected clagd[14291]: server01 is now dual connected.
PTM uses LLDP information to compare against a topology.dot file that describes the network. It has built in alerting capabilities. Use PTM on the switch instead of polling LLDP information regularly. You can install PTM from the Cumulus Linux GitHub repository.
Spanning tree is a protocol that prevents loops in a layer 2 infrastructure. In a stable state, the spanning tree protocol converges. Monitor the Topology Change Notifications (TCN) in STP to identify when new BPDUs arrive.
Interface Counter Element
Monitoring Commands
Interval Poll
STP TCN Transitions
mstpctl showbridge json mstpctl showport json
60 seconds
MLAG peer state
clagctl status clagd -j sudo cat /var/log/clagd.log
60 seconds
MLAG peer MACs
clagctl dumppeermacs clagctl dumpourmacs
300 seconds
Layer 2 Logs
Log Location
Log Entries
Spanning Tree Working
/var/log/syslog
kernel: [1653877.190724] device swp1 entered promiscuous mode kernel: [1653877.190796] device swp2 entered promiscuous mode mstpd: create_br: Add bridge bridge mstpd: clag_set_sys_mac_br: set bridge mac 00:00:00:00:00:00 mstpd: create_if: Add iface swp1 as port#2 to bridge bridge mstpd: set_if_up: Port swp1 : up mstpd: create_if: Add iface swp2 as port#1 to bridge bridge mstpd: set_if_up: Port swp2 : up mstpd: set_br_up: Set bridge bridge up mstpd: MSTP_OUT_set_state: bridge:swp1:0 entering blocking state(Disabled) mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Disabled) mstpd: MSTP_OUT_flush_all_fids: bridge:swp1:0 Flushing forwarding database mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database mstpd: MSTP_OUT_set_state: bridge:swp1:0 entering learning state(Designated) mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering learning state(Designated) sudo: pam_unix(sudo:session): session closed for user root mstpd: MSTP_OUT_set_state: bridge:swp1:0 entering forwarding state(Designated) mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering forwarding state(Designated) mstpd: MSTP_OUT_flush_all_fids: bridge:swp2:0 Flushing forwarding database mstpd: MSTP_OUT_flush_all_fids: bridge:swp1:0 Flushing forwarding database
When FRR boots up for the first time, there is a different log file for each activated daemon. If you edit the log file (for example, through vtysh or frr.conf), the integrated configuration sends all logs to the same file.
To send FRR logs to syslog, apply the configuration log syslog in vtysh.
BGP
When monitoring BGP, check if BGP peers are operational. There is not much value in alerting on the current operational state of the peer; monitoring the transition is more valuable, which you can do by monitoring syslog.
Monitoring the routing table provides trending on the size of the infrastructure. This is useful when you integrate with host-based solutions (such as Routing on the Host) when the routes track with the number of applications available.
BGP Element
Monitoring Commands
Interval Poll
BGP peer failure
sudo vtysh -c "show ip bgp summary json"
60 seconds
BGP route table
sudo vtysh -c "show ip bgp json"
600 seconds
BGP Logs
Log Location
Log Entries
BGP peer down
/var/log/syslog /var/log/frr/*.log
bgpd[3000]: %NOTIFICATION: sent to neighbor swp1 4/0 (Hold Timer Expired) 0 bytes bgpd[3000]: %ADJCHANGE: neighbor swp1 Down BGP Notification send
OSPF
When monitoring OSPF, check if OSPF peers are operational. There is not much value in alerting on the current operational state of the peer; monitoring the transition is more valuable, which you can do by monitoring syslog.
Monitoring the routing table provides trending on the size of the infrastructure. This is useful when you integrate with host-based solutions (such as Routing on the Host) when the routes track with the number of applications available.
OSPF Element
Monitoring Commands
Interval Poll
OSPF protocol peer failure
sudo vtysh -c "show ip ospf neighbor all json" cl-ospf summary show json
60 seconds
OSPF link state database
sudo vtysh - c "show ip ospf database"
600 seconds
Route and Host Entries
Route Element
Monitoring Commands
Interval Poll
Host Entries
cl-resource-query cl-resource-query -k
600 seconds
Route Entries
cl-resource-query cl-resource-query -k
600 seconds
Routing Logs
Layer 3 Logs
Log Location
Log Entries
Routing protocol process crash
/var/log/syslog
frrouting[1824]: Starting FRRouting daemons (prio:10):. zebra. bgpd. bgpd[1847]: BGPd 1.0.0+cl3u7 starting: vty@2605, bgp@:179 zebra[1840]: client 12 says hello and bids fair to announce only bgp routes watchfrr[1853]: watchfrr 1.0.0+cl3u7 watching [zebra bgpd], mode [phased zebra restart] watchfrr[1853]: bgpd state -> up : connect succeeded watchfrr[1853]: bgpd state -> down : read returned EOF cumulus-core: Running cl-support for core files bgpd.3030.1470341944.core.core_helper core_check.sh[4992]: Please send /var/support/cl_support__spine01_20160804_201905.tar.xz to Cumulus support watchfrr[1853]: Forked background command [pid 6665]: /usr/sbin/service frr restart bgpd watchfrr[1853]: watchfrr 0.99.24+cl3u2 watching [zebra bgpd ospfd], mode [phased zebra restart] watchfrr[1853]: zebra state -> up : connect succeeded watchfrr[1853]: bgpd state -> up : connect succeeded watchfrr[1853]: watchfrr: Notifying Systemd we are up and running
Logging
The table below describes the various log files.
Logging Element
Monitoring Commands
Log Location
syslog
Catch all log file. Identifies memory leaks and CPU spikes.
/var/log/syslog
switchd functionality
Hardware Abstraction Layer (HAL).
/var/log/switchd.log
Routing daemons
FRR zebra daemon details.
/var/log/daemon.log
Routing protocol
The log file is configurable in FRR. When FRR first boots, it uses the non-integrated configuration so each routing protocol has its own log file. After booting up, FRR switches over to using the integrated configuration, so that all logs go to a single place.
To edit the location of the log files, use the log file command. By default, Cumulus Linux does not send FRR logs to syslog. Use the log syslog command to send logs through rsyslog and into /var/log/syslog.
Note: To write syslog debug messages to the log file, you must run the log syslog debug command to configure FRR with syslog severity 7 (debug); otherwise, when you issue a debug command such as debug bgp neighbor-events, no output logs to /var/log/frr/frr.log.
However, when you manually define a log target with the log file /var/log/frr/debug.log command, FRR automatically defaults to severity 7 (debug) logging and the output logs to /var/log/frr/frr.log.
Run the following command to confirm that the NTP process is working correctly and that the switch clock is in sync with NTP:
cumulus@switch:~$ /usr/bin/ntpq -p
Device Management
Device Access Logs
Access Logs
Log Location
Log Entries
User Authentication and Remote Login
/var/log/syslog
sshd[31830]: Accepted publickey for cumulus from 192.168.0.254 port 45582 ssh2: RSA 38:e6:3b:cc:04:ac:41:5e:c9:e3:93:9d:cc:9e:48:25 sshd[31830]: pam_unix(sshd:session): session opened for user cumulus by (uid=0)
Device Super User Command Logs
Super User Command Logs
Log Location
Log Entries
Executing commands using sudo
/var/log/syslog
sudo: cumulus: TTY=unknown ; PWD=/home/cumulus ; USER=root ; COMMAND=/tmp/script_9938.sh -v sudo: pam_unix(sudo:session): session opened for user root by (uid=0) sudo: pam_unix(sudo:session): session closed for user root
switchd Log Message Reference
The following table lists the log messages generated by switchd, organized by severity, then message text. These messages appear in /var/log/switchd.log.
Severity
Message Text
Explanation
Recommended Action
CRITICAL
_port_group_config_values_get: hal_list_get failed on [str]
List create failed.
File a ticket with Cumulus Support.
CRITICAL
_range_limits_get: start linux interface name buffer is NULL
Invalid parameter.
File a ticket with Cumulus Support.
CRITICAL
_range_limits_get: end linux interface name buffer is NULL
Invalid parameter.
File a ticket with Cumulus Support.
CRITICAL
_range_limits_get: [str]-[str] not recognized
Invalid port set configuration.
Check QoS configuration file.
CRITICAL
_range_limits_get: port range [str] not recognized
Invalid port set configuration.
Check QoS configuration file.
CRITICAL
_port_group_ports_set: hal_list_get failed on [str]
Port set list create failed.
Check QoS configuration file.
CRITICAL
_port_group_name_list_get: hal_list_get failed on [str]
List create failed.
Check QoS configuration file.
CRITICAL
_port_group_range_translate: _get_range_limits failed on [str]
Invalid port set configuration.
Check QoS configuration file.
CRITICAL
_priority_group_config_get: hal_list_get failed on [str]
Configuration list create failed.
File a ticket with Cumulus Support.
CRITICAL
hal_list_get: list string [str] contains more elements than the maximum allowed ([int])
List capacity exceeded.
File a ticket with Cumulus Support.
CRITICAL
hal_sh_datapath_file_read: could not load config file [str]
Could not load the back end QoS configuration file.
Check backend QoS configuration file.
CRITICAL
Unable to reallocate [int] bytes of memory
Memory allocation failed.
File a ticket with Cumulus Support.
CRITICAL
No backends found.
No back ends found.
File a ticket with Cumulus Support.
CRITICAL
License: email is longer than [int] characters
Email length exceeds maximum.
Modify email address.
CRITICAL
License: license data is longer than [int]
License data exceeds maximum.
Check license.
CRITICAL
License: Invalid format
Invalid license format.
Check license.
CRITICAL
No license file.
No license file found.
Check license file.
CRITICAL
The Cumulus Linux license appears to be invalid. This WILL NOT affect your system operations at the moment. Future versions will enforce fully valid licenses on the system. Please contact licensing@cumulusnetworks.com at your convenience so we can validate and assist you with this licensing issue.
Invalid license.
Check license.
CRITICAL
No license file.
No license file found.
Check license file.
CRITICAL
Incomplete license.
Incomplete license.
Check license.
CRITICAL
License is expired!
License is expired.
Renew license.
CRITICAL
unable to get tap_name for port [uint]
Port config failed: no port name.
File a ticket with Cumulus Support.
CRITICAL
Voluntary restart by timestamp check requested
Voluntary switchd restart.
None.
CRITICAL
Couldn’t write ready file [str]
Could not mark switchd startup complete.
File a ticket with Cumulus Support.
CRITICAL
Could not open [str] to record error type
Could not report restart reason.
File a ticket with Cumulus Support.
CRITICAL
Error setting signal handlers.
Signal handler initialization failed.
File a ticket with Cumulus Support.
CRITICAL
No license to run switchd!
No switchd license is installed.
Install switchd license.
CRITICAL
daemon call failed with rv [int]
switchd could not be daemonized.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t write pid file [str]
Could not write out the process ID.
File a ticket with Cumulus Support.
CRITICAL
Switchd fs init failed.
Failed to initialize the switchd file system.
File a ticket with Cumulus Support.
CRITICAL
Switchd config failed.
Could not load the switchd configuration file.
Check switchd configuration file.
CRITICAL
Netlink init failed.
Netlink initialization failed.
File a ticket with Cumulus Support.
CRITICAL
HAL init failed.
HAL initialization failed.
File a ticket with Cumulus Support.
CRITICAL
NIC init failed.
NC initialization failed.
File a ticket with Cumulus Support.
CRITICAL
Port init failed.
Port initialization failed.
File a ticket with Cumulus Support.
CRITICAL
Bridges init failed.
Bridges initialization failed.
File a ticket with Cumulus Support.
CRITICAL
Bonds init failed.
Bonds initialization failed.
File a ticket with Cumulus Support.
CRITICAL
Logical networks init failed.
Logical networks initialization failed.
File a ticket with Cumulus Support.
CRITICAL
Interface list init failed.
Interface list initialization failed.
File a ticket with Cumulus Support.
CRITICAL
Switchd fs mount failed.
Could not mount switchd file system.
File a ticket with Cumulus Support.
CRITICAL
Failed to add route [str]
Failure in VRF route leak feature. This message notifies that a route entry could not be properly added to one of the software tables.
File a ticket with Cumulus Support.
CRITICAL
MAC address [str] couldn’t be added to or retrieved from hash
Relates to merging MAC tables. The message notifies that an entry expected in a MAC address software table is not found therein.
This should never be seen. File a ticket with Cumulus Support.
CRITICAL
Failed to add route [str]
Add MPLS transit LSP to a software table failed.
File a ticket with Cumulus Support.
CRITICAL
[str]: hal port list malloc failed
Memory exhausted.
File a ticket with Cumulus Support.
CRITICAL
Failed to add [str] to [str]
“Failed to add to ”. Issue happens when addition of the port to a software table fails.
File a ticket with Cumulus Support.
CRITICAL
Failed to add hal mroute for [str]
Failed to add multicast route to a software table.
File a ticket with Cumulus Support.
CRITICAL
Failed to add [str] to [str]
“Failed to add to ”. Issue happens when addition of the port to a software table fails.
File a ticket with Cumulus Support.
CRITICAL
Failed to add grp [str] to mroute
A multicast route for a group could not be added to a software hash table.
File a ticket with Cumulus Support.
CRITICAL
Maximum number of bonds exceeded, max is [int]
Maximum number of bonds exceeded.
Reduce the number of bonds on the switch.
CRITICAL
Maximum number of slaves per bond exceeded, max is [int]
Maximum number of bond members per bond exceeded.
Reduce the number of bond members configured for a bond.
CRITICAL
rtnl slave state get failed for bond:[int] port: [int]
Could not get the bond member state from Netlink.
File a ticket with Cumulus Support.
CRITICAL
Failed to add route [str]
Failed to add route to a software table.
File a ticket with Cumulus Support.
CRITICAL
Failed to add [str] to bridge [int]
Failed to add a port to the bridge.
File a ticket with Cumulus Support.
CRITICAL
Failed to add [str] to grp [str]
Failed to add a port to the MDB group.
File a ticket with Cumulus Support.
CRITICAL
Failed to add [str] to bridge [int]
Failed to add an MDB group to the bridge.
File a ticket with Cumulus Support.
CRITICAL
Failed to add bridge [int] to mdb
Failed to add a given bridge to MDB.
File a ticket with Cumulus Support.
CRITICAL
Failed to add port [str] in grp [str], bridge [int]
Failed to add a port to a group for the specific bridge.
File a ticket with Cumulus Support.
CRITICAL
Failed to add [str] to bridge [int]
Failed to add a port to the bridge.
File a ticket with Cumulus Support.
CRITICAL
Failed to add port [str] in grp [str]
Failed to add a port to the MDB group.
File a ticket with Cumulus Support.
CRITICAL
Failed to add [str] to bridge [int]
Failed to add a port to the bridge.
File a ticket with Cumulus Support.
CRITICAL
Failed to add bridge [str] to mdb
Failed to add a given bridge to MDB.
File a ticket with Cumulus Support.
CRITICAL
Failed to add [str] to bridge [int]
Failed to add a port to the bridge.
File a ticket with Cumulus Support.
CRITICAL
arptables: Memory allocation for rules failed,malloc: [str]
ACL out of memory resource.
File a ticket with Cumulus Support.
CRITICAL
Failed to create kernel bridge
L2: Bridge hash table add failed.
File a ticket with Cumulus Support.
CRITICAL
Open of /dev/net/tun failed: [str]
Failed to create a net device.
File a ticket with Cumulus Support.
CRITICAL
TUNSETIFF failed: [str]
Failed to create a net device.
File a ticket with Cumulus Support.
CRITICAL
SIOCGIFHWADDR failed: [str]
Failed to create a net device.
File a ticket with Cumulus Support.
CRITICAL
SIOCSIFHWADDR failed: [str]
Failed to create a net device.
File a ticket with Cumulus Support.
CRITICAL
TUNSETPERSIST failed: [str]
Failed to create a net device.
File a ticket with Cumulus Support.
CRITICAL
TUNSETOFFLOAD failed: [str]
Failed to create a net device.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t create tuntap ioctl socket.
Failed to create a net device.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t get netdev flags.
Failed to create a net device.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t Set netdev flags.
Failed to create a net device.
File a ticket with Cumulus Support.
CRITICAL
[str]: rtnl_link_alloc failed for family [int]
Failed to create filters.
File a ticket with Cumulus Support.
CRITICAL
[str]: rtnl_neigh_alloc failed for family [int]
Failed to create filters.
File a ticket with Cumulus Support.
CRITICAL
Failed to blacklist interface [int]
Failed to block interfaces.
Check block interfaces.
CRITICAL
F ailed to blacklist interface [int]
Failed to block interfaces.
Check block interfaces.
CRITICAL
Failed to add [str], ifindex [int] to sw_intfs
Failed to create interfaces.
Recreate the interface.
CRITICAL
Failed to delete ifindex [int] from sw_intfs
Failed to create interfaces.
Recreate the interface.
CRITICAL
[str]: could not load interface config
Syncing database failed between kernel and switchd.
File a ticket with Cumulus Support.
CRITICAL
bogus filesystem path: [str]
Failed to add a file in SFS (simple file system).
File a ticket with Cumulus Support.
CRITICAL
Need file spec
Failed to add a file in SFS (simple file system).
File a ticket with Cumulus Support.
CRITICAL
can’t replace existing directory with file: [str]
Failed to add a file in SFS (simple file system).
File a ticket with Cumulus Support.
CRITICAL
filesystem already initialized
Failed to initialize in SFS (simple file system).
File a ticket with Cumulus Support.
CRITICAL
filesystem hash table alloc failed
Failed to allocate hash table in SFS init.
File a ticket with Cumulus Support.
CRITICAL
filesystem mount failed
Failed to mount SFS in swtichd init.
File a ticket with Cumulus Support.
CRITICAL
filesystem new failed
Failed to mount SFS in swtichd init.
File a ticket with Cumulus Support.
CRITICAL
bogus filesystem path: [str]
Failed to delete filesystem in SFS.
File a ticket with Cumulus Support.
CRITICAL
pthread_create failed: [str]
Failed to create a thread in NIC init.
File a ticket with Cumulus Support.
CRITICAL
pthread_detach failed: [str]
Failed to detach a thread in NIC init.
File a ticket with Cumulus Support.
CRITICAL
TX Ring allocation failed: [str]
Failed to alloc packt buffer in NIC init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t increase netlink rbuf size: [str]
Failed to init buffer size in Netlink socket.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t increase netlink wbuf size: [str]
Failed to init buffer size in Netlink socket.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t allocate netlink socket.
Failed to create a Netlink socket.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t connect netlink socket: [str]
Failed to create a Netlink socket.
File a ticket with Cumulus Support.
CRITICAL
nl_resync_route failed for cache [int]: [str]
Failied to create a resync router function in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t set bufsize for manager netlink socket.
Failied to reinitialize buffer size in Netlink socket.
File a ticket with Cumulus Support.
CRITICAL
invalid cache mngrinfo.
Failied to create a resync in Netlink idle callback.
File a ticket with Cumulus Support.
CRITICAL
[str]: failed to close socket: [str]
Failed to configure FD in Netlink socket.
File a ticket with Cumulus Support.
CRITICAL
[str]: nl_cache_mngr_data_ready failed: [str]
Failed to configure FD in Netlink socket.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t allocate netlink socket.
Failed to create a socket in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t allocate netlink socket.
Failed to create a socket in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t allocate manager netlink socket.
Failed to create a socket in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t create cache manager: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t set bufsize for manager netlink socket.
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add link cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add link cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add route cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add mdb cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t alloc neigh cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add mroute cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t alloc tcqdisc cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add tcqdisc cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t alloc tcclass cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add tcclass cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t alloc tccls cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add tccls cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t alloc tcact cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add tcact cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t alloc rule cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add rule cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t add neigh cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t alloc netconf cache: [str]
Failed to allocate a cache in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Couldn’t initialize genl/port interface
Failed to initialize port interface in Netlink init.
File a ticket with Cumulus Support.
CRITICAL
Failed to create kernel bridge
Failed to allocate a hash entry in bridge sync.
File a ticket with Cumulus Support.
CRITICAL
Port msg [str] failure: err [int]
Failed to configure PORT FEC parameter.
File a ticket with Cumulus Support.
CRITICAL
Port msg [str] reply failure: err [int]
Failed to configure PORT FEC parameter.
File a ticket with Cumulus Support.
CRITICAL
Failed run recvmsg_default on port socket, err [int], [str]
Failed to initialize PORT receiving messages.
File a ticket with Cumulus Support.
CRITICAL
vlan stats send failure: err [int]
Failed to configure status in VLAN.
File a ticket with Cumulus Support.
CRITICAL
mroute hitbits send failure: err [int]
Failed to configure hit bit status in MCAST router.
File a ticket with Cumulus Support.
CRITICAL
Port send stats failure: err [int]
Failed to configure status in PORT.
File a ticket with Cumulus Support.
CRITICAL
Port send settings failure: err [int]
Failed to configure settings in PORT.
File a ticket with Cumulus Support.
CRITICAL
Port send carrier failure: err [int]
Failed to configure carrier in PORT.
File a ticket with Cumulus Support.
CRITICAL
ifindex [int] already registered for port ops
Failed to register interface options in PORT.
File a ticket with Cumulus Support.
CRITICAL
ifindex [int] not registered for port ops
Failed to unregister interface options in PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to allocate port hash table
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to allocate port socket
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to genl connect to port socket
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to allocate port sync socket
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to genl connect to port socket
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to set genl port sync socket to non-blocking
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to resolve port ops, err [int]
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to resolve port multicast group
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to register port ops, err [int]
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to add port group membership, err [int]
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
Failed to modify port socket notify cb, err [int]
Failed to initialize PORT.
File a ticket with Cumulus Support.
CRITICAL
[str]:[int]: [str][str]Assertion [str] failed
Failed to find the function name in switchd.
File a ticket with Cumulus Support.
ERROR
priority group [int] headroom count [int] exceeds the maximum value [int]
P ort headroom buffers exceed ASIC limit.
File a ticket with Cumulus Support.
ERROR
shared buffer type [int] not recognized
Invalid buffer type.
File a ticket with Cumulus Support.
ERROR
sx_api_cos_prio_to_ieeeprio_set failed: [str]
IEEE priority map configuration write failed.
File a ticket with Cumulus Support.
ERROR
_hal_mlx_packet_2_switch: priority field [int] not supported
Packet priority field not supported for source.
Check QoS configuration file.
ERROR
priority field [int] not supported
P acket priority field not supported for remark.
Check QoS configuration file.
ERROR
cos list length [int] is longer than maximum value [int]
ECN/RED configuration: list is too long.
Check QoS configuration file.
ERROR
hash params get failed: [str]
ASIC ECMP hash seed configuration read failed.
File a ticket with Cumulus Support.
ERROR
hash params set failed: [str]
ASIC ECMP hash seed configuration write failed.
File a ticket with Cumulus Support.
ERROR
hal_sh_datapath_pfc_set: PFC configuration not supported on the CPU port
Failed to complete the CSV command in Prescriptive Topology Manager.
File a ticket with Cumulus Support.
ERROR
Cannot allocate csv for msg
Failed to read the CSV command in Prescriptive Topology Manager.
File a ticket with Cumulus Support.
ERROR
[str]: Could not allocate context
Failed to decode the CSV command in Prescriptive Topology Manager.
File a ticket with Cumulus Support.
ERROR
STP mode_set failed for port [int]: [str]
Failed to set spanning tree mode in port [uint], error msg [str]. Forwarding behavior would be impacted by this failure.
File a ticket and contact Cumulus Support.
ERROR
failed to set port [int] vlan_ingress_filter enable
Failed to set VLAN ingress filter for port [uint], error msg [str].
File a ticket and contact Cumulus Support.
ERROR
failed to set FDB polling interval swid [uint]: [str]
Failed to set FDB polling interval for Mellanox SDK switchd id [int], error msg [str]. Failure to do this impacts MAC address learning behavior.
File a ticket and contact Cumulus Support.
ERROR
failed to set FDB notify_params swid [uint]: [str]
Failed to set FDB MAC address learning notification in Mellanox SDK for switch id [uint], error msg [str]. This error impacts the capability of the switch to learn MAC address.
File a ticket and contact Cumulus Support.
ERROR
failed to create trap group [uint] trap id [uint] swid [uint] group_attr.prio : [int] error: [str]
Failed to create the TRAP groups in the Mellanox SDK. Traps groups are used for policing trap IDs, which are used to punt control packets to OS stack. This failure impacts packet forwarding.
File a ticket and contact Cumulus Support.
ERROR
failed to open host ifc group [uint] trap id [uint] swid [uint] error [str]
Failed to retrieve the file descriptor of the current open channel to the Mellanox SDK, for ifc group [uint] trap ID [uit] swid [uint], error msg [str]. The error is not recoverable.
File a ticket and contact Cumulus Support.
ERROR
failed to obtain group [uint] FD for polling
Failed to retrieve the FD for a trap group [id]. The error is not recoverable.
File a ticket and contact Cumulus Support.
ERROR
failed to define trap [uint] group [uint] swid [uint] error: [str]
Failed to set trap ID [uint], trap group [uint], switch ID [uint], for user defined trap, error msg [str] .
File a ticket and contact Cumulus Support.
ERROR
failed to set trap [uint] group [uint] swid [uint] error: [str]
Failed to set trap ID [uint], trap group [uint], switch ID [uint], for user defined trap, error msg [str].
File a ticket and contact Cumulus Support.
ERROR
failed to register trap [uint] swid [uint] error: [str]
Failed to register trap ID [uint] in switch ID [uint] in Mellanox SDK, error msg [str].
File a ticket and contact Cumulus Support.
ERROR
trap_id [uint] was not installed
Trap ID [uint] was not installed in the Mellanox SDK. This would impact packet forwarding from the switch ASIC to the control plane.
File a ticket and contact Cumulus Support.
ERROR
trap_id [uint] was not installed
Trap ID [uint] was not installed in the Mellanox SDK. This would impact packet forwarding from the switch ASIC to the control plane.
File a ticket and contact Cumulus Support.
ERROR
dflt_trap_parsing_depth get failed: [str]
Failed to retrieve the Mellanox Spectrum chip parsing depth from Mellanox SDK, error msg [str]. Possibly the parsing depth has not been set correctly. This would impact hardware packet forwarding.
File a ticket and contact Cumulus Support.
ERROR
new_depth [uint] failed: [str]
Failed to set the packet parsing depth [uint] in Mellanox SDK, error msg [str]. This failure impacts hadrware packet forwarding.
File a ticket and contact Cumulus Support.
ERROR
failed to set trap [uint] group [uint] swid [uint] action [uint] error: [str]
Failed to set trap ID [uint], trap group [uint], switch ID [uint], trap action. Failure would lead to the respective control packet not reaching the CPU.
File a ticket and contact Cumulus Support.
ERROR
[str] failed to convert trap policer attributes
Failed to get the policer unit for policer group name [str]. Policer unit can be metered with unit of packets for bytes.
File a ticket and contact Cumulus Support.
ERROR
[str] failed to create policer: [str]
Failed to create policer for policer group [str], error msg [str]. Failure to set policer would impact packet forwarding from hadrware data path to CPU.
File a ticket and contact Cumulus Support.
ERROR
[str] sw_rate_limiter set failed: [str]
Failed to set the software rate limiter for policy group [str] in Mellanox SDK, error msg [str]. This failure could impact rate limiting for packets forwarded to CPU.
File a ticket and contact Cumulus Support.
ERROR
group [str] failed to edit policer: [str]
Failed to modify policer for policer group [str], error msg [str]. Failure to set policer would impact packet forwarding from hadrware data path to CPU.
File a ticket and contact Cumulus Support.
ERROR
unknown trap group [uint]
A trap group ID [uint] unknown to the Mellanox SDK is being used to configure the Mellanox SDK policer. This is an internal configuration error.
File a ticket and contact Cumulus Support.
ERROR
group [str] failed to bind policer %" PRIu64 “: [str]
Policer group [str] with policer ID [uint64] failed to bind in the Mellanox SDK, error msg [str]. This error would impact policing of packets being forwarded from hardware to CPU.
File a ticket and contact Cumulus Support.
ERROR
group [str] failed to unbind policer %” PRIu64 “: [str]
Policer group [str] with policer ID [uint64] failed to unbind in the Mellanox SDK, error msg [str]. This error would impact policing of packets being forwarded from hardware to CPU.
File a ticket and contact Cumulus Support.
ERROR
unsupported type [uint]
Failed to create a trap counter type as the trap counter type [uint] does not match one of the well-defined ones in the Mellanox SDK. This is an internal configuration error.
File a ticket and contact Cumulus Support.
ERROR
unsupported type [uint]
Failed to create a trap counter type as the trap counter type [uint] does not match one of the well-defined ones in Mellanox SDK. This is an internal configuration error.
File a ticket and contact Cumulus Support.
ERROR
type [uint] failed: [str]
Failed to retrieve the host IFC counter for the counter type [int] from the Mellanox SDK, error msg [str].
File a ticket and contact Cumulus Support.
ERROR
unknown meter_type [uint]
Incorrect policer unit [uint] used to find out the policer group meter unit. Policer unit can be metered with unit of packets for bytes. This is an internal error.
File a ticket and contact Cumulus Support.
ERROR
unrecognized lid [hex]
Failed to retrieve the interface key from logical port id [uint].
File a ticket and contact Cumulus Support.
ERROR
[str] unexpected duplicate key [uint]
Failed to add an interface [str] vport with internal VLAN ID [uint] in external VLAN vport hash table because of duplicate entry [uint].
File a ticket and contact Cumulus Support.
ERROR
[str] int_vid [uint] ext_vid [uint]: [str]
Failed to create a virtual port from logical port, interface [str], internal vlan id [uint] and external vlan id[uint], error msg [str]
File a ticket and contact Cumulus Support.
ERROR
Unexpected duplicate vport_lid [hex] for [str]
Failed to add vport logical interface id [uint], interface [str], in vlan vport hash table because of duplicate entry [uint]
File a ticket and contact Cumulus Support.
ERROR
delete failed for [str] int_vid [uint] ext_vid [uint]: [str]
Failed to delete a virtual port from logical port, interface [str], internal vlan id [uint] and external vlan id [uint], error msg [str]
File a ticket and contact Cumulus Support.
ERROR
[str] vrid not found for table [uint]
virtual router id not found in virtual id table [id] in software
Failed to delete a virtual port from logical port, internal vlan id [uint], virtual port logical if [uint] and external vlan id [uint], error msg [str]
File a ticket and contact Cumulus Support.
ERROR
[str] vrid not found for table [uint]
virtual router id not found in virtual id table [id] in software
File a ticket and contact Cumulus Support.
ERROR
port [int] ext_vlan [int] already exists
port [uint] and external vlan [uint] already exists in the e2i table
File a ticket and contact Cumulus Support.
ERROR
[str] int_vlan [int] already assigned to [str]
interface [str] with internal vlan [uint] is already assigned to interface [str] in the e2i table. This is an internal configration error
File a ticket and contact Cumulus Support.
ERROR
failed to get base bond for [str]
Failed to get the bond interface for interface [str]
File a ticket and contact Cumulus Support.
ERROR
failed to add to interface ht s[str]
Failed to add interface [str] to the ifp hash table because an entry already exists
File a ticket and contact Cumulus Support.
ERROR
[str] old_int_vlan [int] inconsistent
interface [str] with vlan [uint] is inconsistent in the e2i table
File a ticket and contact Cumulus Support.
ERROR
[str] new_int_vlan [int] already assigned to [str]
interface [str] with internal vlan [uint] is already assigned to interface [str] in the e2i table. This is an internal configration error
File a ticket and contact Cumulus Support.
ERROR
UC flood block [uint] failed for [str] vlan [uint]: [str]
unicast flood block [uint] failed for interface [str] for vlan id [uint], error msg [str]
File a ticket and contact Cumulus Support.
ERROR
learn mode [uint] failed for [str] vlan [uint]: [str]
learn mode [uint] failed for interface [str] and internal vlan [uint], error msg [str]
File a ticket and contact Cumulus Support.
ERROR
error processing bridge vlan information
error processing bridge vlan information
File a ticket and contact Cumulus Support.
ERROR
bond_mbrs_vlan_port_set failed for bond: [int]
failed to set vlan for bond members for bond id [uint]
File a ticket and contact Cumulus Support.
ERROR
unsupported interface type: [uint]
unsupported interface type [uint]
File a ticket and contact Cumulus Support.
ERROR
cannot find STG for bridge_vlan [uint] vid [uint]
cannot find the spanning tree group for bridge vlan [uint] and vlan id [uint]
File a ticket and contact Cumulus Support.
ERROR
flood_mode_set failed for swid [int] vid [int]
Flood mode could not be set for unregistered multicast in swid [uint] vlan id [uint]
File a ticket and contact Cumulus Support.
ERROR
vlans set failed for [str]: [str]
setting of vlan failed for interface [str], error msg [str]
File a ticket and contact Cumulus Support.
ERROR
qinq mode set failed for [str]: [str]
failed to set qinq mode for interface [str], error msg [str]
File a ticket and contact Cumulus Support.
ERROR
qinq mode set failed for [str]: [str]
failed to set qinq mode for bond for interface [str], error msg [str]
File a ticket and contact Cumulus Support.
ERROR
unsupported interface type: [uint]
unsupported interface type [uint]
File a ticket and contact Cumulus Support.
ERROR
bond id [uint] not fully created
bond id [uint] creation is not complete
File a ticket and contact Cumulus Support.
ERROR
cannot find bridge vlan for bridge: [int]
unable to find bridge vlan for the bridge id [uint]
File a ticket and contact Cumulus Support.
ERROR
cannot find bond vlan for bond
cannot find bond vlan for the bond
File a ticket and contact Cumulus Support.
ERROR
cannot allocate vlan for bond interface
Failed to allocate vlan for bond interface
File a ticket and contact Cumulus Support.
ERROR
cannot allocate vlan for sub-interface
Failed to allocate vlan for sub interface
File a ticket and contact Cumulus Support.
ERROR
gre tunnel decap entry creation failed : [str]
Failed to create decapsulation entry in Mellanox SDK. Decapsulation of GRE packet woould not be operational, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
gre tunnel decap destroy failed : [str]
Failed to delete the GRE decapsulation entry in Mellanox SDK, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
gre tunnel curr decap entry delete failed : [str]
Failed to delete the GRE decapsulation entry in Mellanox SDK, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
gre tunnel new decap entry update failed : [str]
Failed to create decapsulation entry in Mellanox SDK. Decapsulation of GRE packet woould not be operational, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
failed to make logical gre key
Failed to form the logical GRE key from the interface information provided
File a ticket and contact Cumulus Support.
ERROR
failed to make gre decap key
Failed to form the logical decap GRE key from the information provided
File a ticket and contact Cumulus Support.
ERROR
failed to make overlay key from underlay key
Failed to create overlay gre key from underlay information
File a ticket and contact Cumulus Support.
ERROR
unable to find gre entry for tunnel id ([hex]
Failed to find gre entry from tunnel id in the gre tunnel key hash table, using tunnel id [uint]
File a ticket and contact Cumulus Support.
ERROR
duplicate entry in overlay ht : ifindex ([int]
Unable to add a duplicate gre entry with ifindex [uint] in the gre overlay hash table. A duplicate config is being attempted
File a ticket and contact Cumulus Support.
ERROR
failed to create overlay rif : ifindex : [int] tunnel type [uint] key [uint]
Unable to create an overlay router interface with ifindex [uint] , tunnel type [uint] and tunel key [uint]
File a ticket and contact Cumulus Support.
ERROR
gre tunnel creation failed: [str] :
Failed to create tunnel id for GRE in Mellanox SDK, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
invalid argument
GRE update is being called with an invalid GRE information, no operation would be performed
File a ticket and contact Cumulus Support.
ERROR
gre tunnel ([hex]) update failed: [str] :
Failed to update tunnel id [hex] for GRE in Mellanox SDK, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
gre tunnel destroy failed: [str]
Failed to delete tunnel id for GRE in Mellanox SDK, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
loopback rif for ifindex ([int]) : [str]
Failed to add the loopback router interface, interface ifindex [uint] in Mellanox SDK, error [str]
File a ticket and contact Cumulus Support.
ERROR
ifindex ([int]) overlay rif ([int]) : [str]
Failed to delete the loopback router interface, interface ifindex [uint], overlay router interface [uint] in Mellanox SDK, error [str]
File a ticket and contact Cumulus Support.
ERROR
cannot allocate bridge vlan for bridge id [int]
Failed to allocate bridge vlan for bridge id [uint]
Check The Cumulus Linux Configuration guide
ERROR
flood_mode_set failed for swid [int] vid [int]
Flood mode could not be set for unregistered multicast in swid [uint] vlan id [uint]
File a ticket and contact Cumulus Support.
ERROR
cannot allocate ln_vlan [uint] for bridge_id [int]
Failed to allocate vlan [uint] for bridge id [uint]
Check The Cumulus Linux Configuration guide
ERROR
flood_mode_set failed for swid [int] vid [int]
Flood mode could not be set for unregistered multicast in swid [uint] vlan id [uint]
File a ticket and contact Cumulus Support.
ERROR
vlan [uint] not yet allocated
vlan [uint] does not exists for the bridge and is not allocated
Check The Cumulus Linux Configuration guide
ERROR
[str] bridge_id [uint] vlan [uint] port [hex] failed: [str]
failed to add a unicast mac address [str] on bridge id [uint] vlan [uint] port [uint]. This could be an internal error or could be because of configuration error
File a ticket and contact Cumulus Support.
ERROR
[str] bridge_id [uint] vlan [uint] port [hex] failed: [str]
failed to delete a unicast mac address [str] on bridge id [uint] vlan [uint] port [uint]. This could be an internal error or could be because of configuration error
failed to delete a number [uint] of unicast mac address, error msg [str]. This could be an internal error or could be because of configuration error
File a ticket and contact Cumulus Support.
ERROR
num_macs [uint] learn set failed: [str]
failed to add [uint] unicast mac addressess, error msg [str]. This could be because of resource exhaustion
File a ticket and contact Cumulus Support.
ERROR
num_macs [uint] delete failed: [str]
failed to delete a number [uint] of unicast mac address, error message [str]. This could be an internal error or could be because of configuration error
File a ticket and contact Cumulus Support.
ERROR
age_time set failed [str] on swid [uint]
Failed to set fdb ageing time. This would cause the mac addresses in FDB not to age mac address
File a ticket and contact Cumulus Support.
ERROR
cannot find vlan for brmac [str] vfid [uint]
vlan [uint] does not exists for the bridge and so could not find the vlan for bridge mac address
Check The Cumulus Linux Configuration guide
ERROR
vfid not set for vlan [uint]
failed to return a translated vlan id for vlan [uint]
File a ticket and contact Cumulus Support.
ERROR
num_macs [uint] delete failed: [str]
failed to delete a number [uint] of unicast mac address, error msg [str]. This could be an internal error or could be because of configuration error
File a ticket and contact Cumulus Support.
ERROR
bridge_vlan [uint] expected swid [uint] but found [uint]
bridge vlan id [uint] expected switchd id [uint] for the vlan is [uint] switch id
Check The Cumulus Linux Configuration guide
ERROR
num_macs [uint] delete failed: [str]
failed to delete a number [uint] of unicast mac address, error msg [str]. This could be an internal error or could be because of configuration error
File a ticket and contact Cumulus Support.
ERROR
bridge_vlan [uint] expected swid [uint] but found [uint]
bridge vlan id [uint] expected switchd id [uint] for the vlan is [uint] switch id
Check The Cumulus Linux Configuration guide
ERROR
get failed: [str]
Failed to get fdb unicast mac address from Mellanox SDK, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
num_macs [uint] delete failed: [str]
failed to delete a unicast mac address [str] on bridge id [uint] vlan [uint] port [uint]. This could be an internal error or could be because of configuration error
File a ticket and contact Cumulus Support.
ERROR
internal vlans exhausted
total number of internal vlan has exhausted. No morevlans could be addded
Check The Cumulus Linux Configuration guide
ERROR
identity map failed for vlan [uint]: [str]
Failed to map the forwarding id to the vlan id [uint] in Mellanox SDK, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
learn mode_failed for vlan [uint]: [str]
failed to set the learning mode for vlan id [uint], error message [str]
File a ticket and contact Cumulus Support.
ERROR
failed to get members for vlan [uint]: [str]
Failed to get member port for vlan [uint], error message [uint]
File a ticket and contact Cumulus Support.
ERROR
vlan [uint] is not an L3 vlan
vlan [uint] entry is not representing a l3 interface
Check The Cumulus Linux Configuration guide
ERROR
unsupported interface type: [uint]
unsupported interface type [uint]
File a ticket and contact Cumulus Support.
ERROR
[hex] int_vlan [uint] failed: [str]
failed to set the internal vlan [uint] for the logical port id [uint], error msg [str]
Check The Cumulus Linux Configuration guide
ERROR
[hex] pvid [uint] failed: [str]
failed to set the logical interface [uint] to van id [uint], error msg [str]
Check The Cumulus Linux Configuration guide
ERROR
[hex] int_vlan [uint] failed: [str]
failed to unset the internal vlan [uint] for the logical port id [uint], error msg [str]
Check The Cumulus Linux Configuration guide
ERROR
[hex] revert pvid: [str]
failed to delete the logical interface [uint] to vlan id [uint], error msg [str]
Check The Cumulus Linux Configuration guide
ERROR
unsupported interface type: [uint]
unsupported interface type [uint]
File a ticket and contact Cumulus Support.
ERROR
unsupported interface type: [uint]
unsupported interface type [uint]
File a ticket and contact Cumulus Support.
ERROR
failed for lid [hex] int_vlan [uint] STG [uint]: [str]
Failed to set the spanning tree group for logical interface [uint], internal vlan [uint] spanning tree group [uint], error msg [%s]
File a ticket and contact Cumulus Support.
ERROR
unsupported if_type [uint]
unsupported interface type [uint]
File a ticket and contact Cumulus Support.
ERROR
port [str] not established
port interface [str] has not been established yet
File a ticket and contact Cumulus Support.
ERROR
failed for [str] lid [hex]: [str]
Failed to set the port, logical id [uint], interface [str], to accept the frame type, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
list allocation failed
failed to allocate memory for the ports
File a ticket and contact Cumulus Support.
ERROR
STGs exhausted
Total number of spanning tree group has exhausetd. please consult configuration manual
Check The Cumulus Linux Configuration guide
ERROR
MSTP instance set failed for STG [int]: [str]
Failed to set the MSTP instance for the spanning tree group [uint] in Mellanox SDK, error msg [str]
Check The Cumulus Linux Configuration guide
ERROR
failed to delete STG [uint]: [str]
Failed to delete the MSTP instance for the spanning tree group [uint] in Mellanox SDK, error msg [str]
Check The Cumulus Linux Configuration guide
ERROR
failed to add vlan [int] to STG [int]: [str]
Failed to add vlan [uint] to spanning tree group [uint] in Mellanox SDK, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
failed to remove vlan [int] from STG [int]: [str]
Failed to delete vlan [uint] to spanning tree group [uint] in Mellanox SDK, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
vlan [uint] not yet created
vlan [uint] does not exist and is not allocated
Check The Cumulus Linux Configuration guide
ERROR
STG [int] not yet created
spanning tree group id [uint] is not created
File a ticket and contact Cumulus Support.
ERROR
Duplicate vfid [uint]
Failed to add virtual forwarding id [uint] in hash table because of a duplicate entry
File a ticket and contact Cumulus Support.
ERROR
fdb_uc_mac_addr_get failed: [str]
Failed to get fdb unicast mac address from Mellanox SDK, error msg [str]
File a ticket and contact Cumulus Support.
ERROR
failed to allocate mac_list
Failed to allocate mac address list
Check The Cumulus Linux Configuration guide
ERROR
init set failed: [str]
Initilizition of router module in Mellanox SDK failed.
File a ticket and contact Cumulus Support.
ERROR
hash params set failed: [str]
Initilizition of router ecmp hash module in Mellanox SDK failed.
File a ticket and contact Cumulus Support.
ERROR
router #[uint] set failed: [str]
Initilizition of virtual router id [uint] failed, because of error [str] in Mellanox SDK.
Error removing isolated port [int] from [int]. Error: [str]
Removing isolated VPN port from the SDK failed.
File a ticket with Cumulus Support.
WARNING
Error adding isolated port [int] to [int]. Error: [str]
Adding isolated VPN port to the SDK failed.
File a ticket with Cumulus Support.
WARNING
Error removing isolated port [int] from [int].
Moving isolated VPN port from the SDK failed.
File a ticket with Cumulus Support.
WARNING
[str]: failed to push port settings to hal. err = [int]
table_id could not be set for a port.
File a ticket with Cumulus Support.
WARNING
[str] not found in grp [str], bridge [int]
A port not found in the given group for the specific bridge.
File a ticket with Cumulus Support.
WARNING
grp [str] not found in bridge [int]
During deletion, an MDB group was not found for a specific bridge.
File a ticket with Cumulus Support.
WARNING
lid 0x%x cannot be both SPAN source
ACL: SPAN source and target cannot be the same.
Remove the rule.
WARNING
CPU not supported as mirror port
ACL: Mirror target port cannot be CPU.
Remove these SPAN rules.
WARNING
table [str] [str] chain [str] L2 header field match not supported with IPv6 key
ACL: Specified match unsupported.
Remove the rule.
WARNING
table [str] [str] chain [str] IP TTL not supported with MAC+IPv4 key
ACL: Unsupported match.
Remove the rule.
WARNING
table [str] [str] chain [str] requires hardware IPv6 rule format but platform does not support MAC+IPv6 key combination
ACL: Unsupported match.
Remove the rule.
WARNING
table [str] [str] chain [str] requires hardware IPv4 rule but platform does not support IPv4 key with
ACL: Unsupported match.
Remove the rule.
WARNING
logical network type not supported
ACL: Unsupported interface type in internal VXLAN rules.
File a ticket with Cumulus Support.
WARNING
Detected excessive moves of mac address [str] on bridge [str], last seen on [str] and [str].
L2: Too many MAC moves seen.
Check network topology for loops or intrusion.
WARNING
Memory allocation failed
Memory exhausted.
File a ticket with Cumulus Support.
WARNING
Can’t open configuration file [str]: [int]
Failed to read configuration file in SFS.
File a ticket with Cumulus Support.
WARNING
tx failed with count [int], start %p
Failed to transfer packets in NIC.
File a ticket with Cumulus Support.
WARNING
Detected excessive moves of mac address [str] on bridge [str],
Moved MAC addresses over threshold.
WARNING
Unsupported command [str]
Wrong FEC command.
File a ticket with Cumulus Support.
WARNING
genl_talk returned error for ifindex [int] ([str])
Failed to read cached settings in PORT.
File a ticket with Cumulus Support.
WARNING
new tag_state [str] mismatches with [str] for [str] int_vlan [uint]
The new configured tag state [str] mismatches with the old tag state [str] for the internal VLAN [id].
File a ticket and contact Cumulus Support.
WARNING
verbosity level for SDK module [uint] not present
incorrect verbosity level for the Mellanox SDK module is being configured. This is an internal error.
File a ticket and contact Cumulus Support.
WARNING
legacy SX2 nexthop route type [uint] not handled.
for Legacy Mellanox SX2 chip, next hop of route type [uint] not handled.
File a ticket and contact Cumulus Support.
WARNING
hash_table_delete of clone parent from id_ht %p failed.
Failed to delete an ECMP clone id in, clone id hash table, as a entry does not exists.
File a ticket and contact Cumulus Support.
WARNING
[str]: no parent for [str]
Missing parent interface.
File a ticket with Cumulus Support.
WARNING
[str]: no parent for [str]
Missing parent interface.
File a ticket with Cumulus Support.
FRRouting Log Message Reference
The following table lists the HIGH severity ERROR log messages generated by FRR. These messages appear in /var/log/frr/frr.log.
Category
Severity
Message #
Message Text
Explanation
Recommended Action
Babel
HIGH
16777217
BABEL Memory Errors
Babel has failed to allocate memory. The system is about to run out of memory.
Find the process that is causing memory shortages and remediate that process. Restart FRR.
Babel
HIGH
16777218
BABEL Packet Error
Babel has detected a packet encode/decode problem.
Collect the relevant log files and report the issue for troubleshooting.
Babel
HIGH
16777219
BABEL Configuration Error
Babel has detected a configuration error of some sort.
Ensure that the configuration is correct.
Babel
HIGH
16777220
BABEL Route Error
Babel has detected a routing error and is in an inconsistent state.
Gather data to report the issue for troubleshooting. Restart FRR.
BGP
HIGH
33554433
BGP attribute flag is incorrect
BGP attribute flag is set to the wrong value (Optional/Transitive/Partial).
Determine the source of the attribute and determine why the attribute flag has been set incorrectly.
BGP
HIGH
33554434
BGP attribute length is incorrect
BGP attribute length is incorrect.
Determine the source of the attribute and determine why the attribute length has been set incorrectly.
BGP
HIGH
33554435
BGP attribute origin value invalid
BGP attribute origin value is invalid.
Determine the source of the attribute and determine why the origin attribute has been set incorrectly.
BGP
HIGH
33554436
BGP as path is invalid
BGP AS path has been malformed.
Determine the source of the update and determine why the AS path has been set incorrectly.
BGP
HIGH
33554437
BGP as path first as is invalid
BGP update has invalid first AS in AS path.
Determine the source of the update and determine why the AS path first AS value has been set incorrectly.
BGP
HIGH
33554439
BGP PMSI tunnel attribute type is invalid
BGP update has invalid type for PMSI tunnel.
Determine the source of the update and determine why the PMSI tunnel attribute type has been set incorrectly.
BGP
HIGH
33554440
BGP PMSI tunnel attribute length is invalid
BGP update has invalid length for PMSI tunnel.
Determine the source of the update and determine why the PMSI tunnel attribute length has been set incorrectly.
BGP
HIGH
33554442
BGP peergroup operated on in error
BGP operating on peer-group instead of peers included.
Ensure the configuration doesn’t contain peer-groups contained within peer-groups.
BGP
HIGH
33554443
BGP failed to delete peer structure
BGP was unable to delete the peer structure when the address-family was removed.
Determine if all expected peers are removed and restart FRR if not. This is most likely a bug.
BGP
HIGH
33554444
BGP failed to get table chunk memory
BGP unable to get chunk memory for table manager.
Ensure there is adequate memory on the device to support the table requirements.
BGP
HIGH
33554445
BGP received MACIP with invalid IP addr len
BGP received MACIP with invalid IP address length from Zebra.
Verify the MACIP entries inserted in Zebra are correct. This is most likely a bug.
BGP
HIGH
33554446
BGP received invalid label manager message
BGP received an invalid label manager message from the label manager.
Label manager sent an invalid message to BGP for the wrong protocol instance. This is most likely a bug.
BGP
HIGH
33554447
BGP unable to allocate memory for JSON output
BGP attempted to generate JSON output and was unable to allocate the memory required.
Ensure that the device has adequate memory to support the required functions.
BGP
HIGH
33554448
BGP update had attributes too long to send
BGP attempted to send an update but the attributes were too long to fit.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554449
BGP update group creation failed
BGP attempted to create an update group but was unable to do so.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554450
BGP error creating update packet
BGP attempted to create an update packet but was unable to do so.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554451
BGP error receiving open packet
BGP received an open from a peer that was invalid.
Determine the sending peer and correct its invalid open packet.
BGP
HIGH
33554452
BGP error sending to peer
BGP attempted to respond to open from a peer and failed.
BGP attempted to respond to an open and could not send the packet. Check the local IP address for the source.
BGP
HIGH
33554453
BGP error receiving from peer
BGP received an update from a peer but the status was incorrect.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554454
BGP error receiving update packet
BGP received an invalid update packet.
Determine the source of the update and resolve the invalid update being sent.
BGP
HIGH
33554455
BGP error due to capability not enabled
BGP attempted a function that did not have the capability enabled.
Enable the capability if this functionality is desired.
BGP
HIGH
33554456
BGP error receiving notify message
BGP unable to process the notification message.
BGP notify received while in a stopped state. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554457
BGP error receiving keepalive packet
BGP unable to process a keepalive packet.
BGP keepalive received while in a stopped state. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554458
BGP error receiving route refresh message
BGP unable to process route refresh message.
BGP route refresh received while in a stopped state. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554459
BGP error capability message
BGP unable to process received capability.
BGP capability message received while in a stopped state. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554460
BGP error with nexthop update
BGP unable to process nexthop update.
BGP received the nexthop update but the nexthop is not reachable in this BGP instance. Report the problem for troubleshooting.
BGP
HIGH
33554461
Failure to apply label
BGP attempted to apply a label but could not do so.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554462
Multipath specified is invalid
BGP was started with an invalid ECMP/multipath value.
Correct the ECMP/multipath value supplied when starting the BGP daemon.
BGP
HIGH
33554463
Failure to process a packet
BGP attempted to process a received packet but could not do so.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554464
Failure to connect to peer
BGP attempted to send open to a peer but couldn’t connect.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554465
BGP FSM issue
BGP neighbor transition problem.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554466
BGP VNI creation issue
BGP could not create a new VNI.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554467
BGP default instance missing
BGP could not find default instance.
Define a default instance of BGP; some feature requires its existence.
BGP
HIGH
33554468
BGP remote VTEP invalid
BGP remote VTEP is invalid and cannot be used.
Correct the remote VTEP configuration or resolve the source of the problem.
BGP
HIGH
33554469
BGP ES route error
BGP ES route incorrect as it learned both local and remote routes.
Correct the configuration or address it so that same route is not learned both local and remote.
BGP
HIGH
33554470
BGP EVPN route delete error
BGP attempted to delete an EVPN route and failed.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554471
BGP EVPN install/uninstall error
BGP attempted to install or uninstall an EVPN prefix and failed.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554472
BGP EVPN route received with invalid contents
BGP received an EVPN route with invalid contents.
Determine the source of the EVPN route and resolve whatever is causing the invalid content.
BGP
HIGH
33554473
BGP EVPN route create error
BGP attempted to create an EVPN route and failed.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554474
BGP EVPN ES entry create error
BGP attempted to create an EVPN ES entry and failed.
This is most likely a bug. If the problem persists, report it for troubleshooting.
BGP
HIGH
33554475
BGP config multi-instance issue
BGP configuration attempting multiple instances without enabling the feature.
Correct the configuration so that BGP multiple-instance is enabled if desired.
BGP
HIGH
33554476
BGP AS configuration issue
BGP configuration attempted for a different AS than is currently configured.
Correct the configuration so that the correct BGP AS number is used.
BGP
HIGH
33554477
BGP EVPN AS and process name mismatch
BGP configuration has AS and process name mismatch.
Correct the configuration so that the BGP AS number and instance name are consistent.
BGP
HIGH
33554478
BGP Flowspec packet processing error
The BGP flowspec subsystem has detected an error in the sending or receiving of a packet.
Gather log files from both sides of the peering relationship and report the issue for troubleshooting.
BGP
HIGH
33554479
BGP Flowspec Installation/removal Error
The BGP flowspec subsystem has detected that there was a failure for installation/removal/modification of Flowspec from the dataplane.
Gather log files from the router and report the issue for troubleshooting. Restart FRR.
EIGRP
HIGH
50331649
EIGRP Packet Error
EIGRP has a packet that does not correctly decode or encode.
Gather log files from both sides of the neighbor relationship and report the issue for troubleshooting.
EIGRP
HIGH
50331650
EIGRP Configuration Error
EIGRP has detected a configuration error.
Correct the configuration issue. If it still persists, report the issue for troubleshooting.
General
HIGH
100663297
Failure to raise or lower privileges
FRR attempted to raise or lower its privileges and was unable to do so.
Ensure that you are running FRR as the frr user and that the user has sufficient privileges to properly access root privileges.
General
HIGH
100663298
VRF Failure on Start
Upon startup, FRR failed to properly initialize and start up the VRF subsystem.
Ensure that there is sufficient memory to start processes, then restart FRR.
General
HIGH
100663299
Socket Error
When attempting to access a socket, a system error occurred and FRR was unable to properly complete the request.
Ensure that there are sufficient system resources available and ensure that the frr user has sufficient permissions to work.
General
HIGH
100663303
System Call Error
FRR has detected an error from using a vital system call and has probably already exited.
Ensure permissions are correct for FRR users and groups. Additionally, check that sufficient system resources are available.
General
HIGH
100663304
VTY Subsystem Error
FRR has detected a problem with the specified configuration file.
Ensure the configuration file exists and has the correct permissions for operations. Additionally, ensure that all config lines are correct as well.
General
HIGH
100663305
SNMP Subsystem Error
FRR has detected a problem with the SNMP library it uses. A callback from this subsystem has indicated some error.
Examine the callback message and ensure SNMP is properly set up and working.
General
HIGH
100663306
Interface Subsystem Error
FRR has detected a problem with interface data from the kernel as it deviates from what we would expect to happen via normal netlink messaging.
Open an issue with all relevant log files and restart FRR.
General
HIGH
100663307
NameSpace Subsystem Error
FRR has detected a problem with namespace data from the kernel as it deviates from what we would expect to happen via normal kernel messaging.
Open an issue with all relevant log files and restart FRR.
General
HIGH
4043309068
A necessary work queue does not exist.
A necessary work queue does not exist.
Notify a developer.
General
HIGH
100663308
Developmental Escape Error
FRR has detected an issue where new development has not properly updated all code paths.
Open an issue with all relevant log files.
General
HIGH
100663309
ZMQ Subsystem Error
FRR has detected an issue with the ZeroMQ subsystem and ZeroMQ is not working properly now.
Open an issue with all relevant log files and restart FRR.
General
HIGH
100663310
Feature or system unavailable
FRR was not compiled with support for a particular feature or it is not available on the current platform.
Recompile FRR with the feature enabled or find out what platforms support the feature.
General
HIGH
4043309071
IRDP message length mismatch
The length encoded in the IP TLV does not match the length of the packet received.
Notify a developer.
General
HIGH
4043309073
Dataplane installation failure
Installation of routes to the underlying dataplane failed.
Check all configuration parameters for correctness.
General
HIGH
4043309075
Netlink backend not available
FRR was not compiled with support for Netlink. Any operations that require Netlink will fail.
Recompile FRR with Netlink or install a package that supports this feature.
General
HIGH
4043309076
Protocol Buffers backend not available
FRR was not compiled with support for protocol buffers. Any operations that require protobuf will fail.
Recompile FRR with protobuf support or install a package that supports this feature.
General
HIGH
4043309087
Cannot set receive buffer size
The socket receive buffer size could not be set in the kernel.
Ignore this error.
General
HIGH
4043309089
Receive buffer overrun
The kernel’s buffer for a socket has been overrun, rendering the socket invalid.
Zebra will restart itself. Notify a developer if this issue shows up frequently.
General
HIGH
4043309091
Received unexpected response from kernel
Received unexpected response from the kernel via Netlink.
Notify a developer.
General
HIGH
4043309094
String could not be parsed as IP prefix
There was an attempt to parse a string as an IPv4 or IPv6 prefix, but the string could not be parsed and this operation failed.
Notify a developer.
General
HIGH
268435457
WATCHFRR Connection Error
WATCHFRR has detected a connectivity issue with one of the FRR daemons.
Ensure that FRR is still running. If it isn’t, report the issue for troubleshooting.
ISIS
HIGH
67108865
ISIS Packet Error
ISIS has detected an error with a packet from a peer.
Gather log information and report the issue for troubleshooting. Restart FRR.
ISIS
HIGH
67108866
ISIS Configuration Error
ISIS has detected an error within the configuration for the router.
Ensure configuration is correct.
OSPF
HIGH
134217729
Failure to process a packet
OSPF attempted to process a received packet but could not do so.
This is most likely a bug. If the problem persists, report it for troubleshooting.
OSPF
HIGH
134217730
Failure to process Router LSA
OSPF attempted to process a router LSA, but there was an advertising ID mismtach with the link ID.
Check the OSPF network configuration for any configuration issue. If the problem persists, report it for troubleshooting.
OSPF
HIGH
134217731
OSPF Domain Corruption
OSPF attempted to process a router LSA, but there was an advertising ID mismtach with the link ID.
Check OSPF network database for a corrupted LSA. If the problem persists, shut down the OSPF domain and report the problem for troubleshooting.
OSPF
HIGH
134217732
OSPF Initialization failure
OSPF failed to initialize the OSPF default instance.
Ensure there is adequate memory on the device. If the problem persists, report it for troubleshooting.
OSPF
HIGH
134217733
OSPF SR Invalid DB
OSPF segment routing database is invalid.
This is most likely a bug. If the problem persists, report it for troubleshooting.
OSPF
HIGH
134217734
OSPF SR hash node creation failed
OSPF segment routing node creation failed.
This is most likely a bug. If the problem persists, report it for troubleshooting.
OSPF
HIGH
134217735
OSPF SR Invalid lsa id
OSPF segment routing invalid LSA ID.
Restart the OSPF instance. If the problem persists, report it for troubleshooting.
OSPF
HIGH
134217736
OSPF SR Invalid Algorithm
OSPF segment routing invalid algorithm.
This is most likely a bug. If the problem persists, report it for troubleshooting.
PIM
HIGH
184549377
PIM MSDP Packet Error
PIM has received a packet from a peer that does not correctly decode.
Check the MSDP peer and ensure it is correctly working.
PIM
HIGH
184549378
PIM Configuration Error
PIM has detected a configuration error.
Ensure the configuration is correct and apply the correct configuration.
RIP
HIGH
201326593
RIP Packet Error
RIP has detected a packet encode/decode issue.
Gather log files from both sides and open a Issue
Zebra
HIGH
4043309057
Error reading response from label manager
Zebra could not read the ZAPI header from the label manager.
Wait for the error to resolve on its own. If it does not resolve, restart Zebra.
Zebra
HIGH
4043309058
Label manager could not find ZAPI client
Zebra was unable to find a ZAPI client matching the given protocol and instance number.
Ensure that clients that use the label manager are properly configured and running.
Zebra
HIGH
4043309059
Zebra could not relay label manager response
Zebra found the client and instance to relay the label manager response or request, but was unable to do so, possibly because the connection was closed.
Ensure that clients that use the label manager are properly configured and running.
Zebra
HIGH
100663300
ZAPI Error
A version mismatch has been detected between Zebra and a client protocol.
Two different versions of FRR have been installed and the install is not properly set up. Completely stop FRR, remove it from the system and reinstall. Typically, only developers should see this issue.
Zebra
HIGH
4043309061
Mismatch between ZAPI instance and encoded message instance
While relaying a request to the external label manager, Zebra noticed that the instance number encoded in the message did not match the client instance number.
Notify a developer.
Zebra
HIGH
100663301
ZAPI Error
The ZAPI subsystem has detected an encoding issue between Zebra and a client protocol.
Restart FRR.
Zebra
HIGH
100663302
ZAPI Error
The ZAPI subsystem has detected a socket error between Zebra and a client.
Restart FRR.
Zebra
HIGH
4043309064
Zebra label manager used all available labels
Zebra is unable to assign additional label chunks because it has exhausted its assigned label range.
Make the label range bigger and restart Zebra.
Zebra
HIGH
4043309065
Daemon mismatch when releasing label chunks
Zebra noticed a mismatch between a label chunk and a protocol daemon number or instance when releasing unused label chunks.
Ignore this error.
Zebra
HIGH
4043309066
Zebra did not free any label chunks
Zebra’s chunk cleanup procedure ran but no label chunks were released.
Ignore this error.
Zebra
HIGH
4043309067
Dataplane returned invalid status code
The underlying dataplane responded to a Zebra message or other interaction with an unrecognized unknown or invalid status code.
Notify a developer.
Zebra
HIGH
4043309069
Failed to add FEC for MPLS client
A client requested a label binding for a new FEC but Zebra was unable to add the FEC to its internal table.
Notify a developer.
Zebra
HIGH
4043309070
Failed to remove FEC for MPLS client
Zebra was unable to find and remove an FEC in its internal table.
Notify a developer.
Zebra
HIGH
4043309072
Attempted to perform nexthop update for unknown address family
Zebra attempted to perform a nexthop update for unknown address family.
Notify a developer.
Zebra
HIGH
4043309074
Zebra table lookup failed
Zebra attempted to look up a table for a particular address family and a subsequent address family but didn’t find anything.
If you entered a command to trigger this error, make sure you entered the arguments correctly. Check your configuration file for any potential errors. If these look correct, notify a developer.
Zebra
HIGH
4043309077
Table manager used all available IDs
Zebra’s table manager used up all IDs available to it and can’t assign any more.
Reconfigure Zebra with a larger range of table IDs.
Zebra
HIGH
4043309078
Daemon mismatch when releasing table chunks
Zebra noticed a mismatch between a table ID chunk and a protocol daemon number instance when releasing unused table chunks.
Ignore this error.
Zebra
HIGH
4043309079
Zebra did not free any table chunks
Zebra’s table chunk cleanup procedure ran but no table chunks were released.
Ignore this error.
Zebra
HIGH
4043309080
Address family specifier unrecognized
Zebra attempted to process information from somewhere that included an address family specifier but did not recognize the provided specifier.
Ensure that your configuration is correct. If it is, notify a developer.
Zebra
HIGH
4043309081
Incorrect protocol for table manager client
Zebra’s table manager only accepts connections from daemons managing dynamic routing protocols, but received a connection attempt from a daemon that does not meet this criterion.
Notify a developer.
Zebra
HIGH
4043309082
Mismatch between message and client protocol and/or instance
Zebra detected a mismatch between a client’s protocol and/or instance numbers versus those stored in a message transiting its socket.
Notify a developer.
Zebra
HIGH
4043309083
Label manager unable to assign label chunk
Zebra’s label manager was unable to assign a label chunk to client.
Ensure that Zebra has a sufficient label range available and that there is not a range collision.
Zebra
HIGH
4043309084
Label request from unidentified client
Zebra’s label manager received a label request from an unidentified client.
Notify a developer.
Zebra
HIGH
4043309085
Table manager unable to assign table chunk
Zebra’s table manager was unable to assign a table chunk to a client.
Ensure that Zebra has sufficient table ID range available and that there is not a range collision.
Zebra
HIGH
4043309086
Table request from unidentified client
Zebra’s table manager received a table request from an unidentified client.
Notify a developer.
Zebra
HIGH
4043309088
Unknown Netlink message type
Zebra received a Netlink message with an unrecognized type field.
Verify that you are running the latest version of FRR to ensure kernel compatibility. If the problem persists, notify a developer.
Zebra
HIGH
4043309090
Netlink message length mismatch
Zebra received a Netlink message with incorrect length fields.
Notify a developer.
Zebra
HIGH
4043309092
Bad sequence number in Netlink message
Zebra received a Netlink message with a bad sequence number.
Notify a developer.
Zebra
HIGH
4043309093
Multipath number was out of valid range
The multipath number specified to Zebra must be in the appropriate range.
Provide a multipath number that is within its accepted range.
Zebra
HIGH
4043309095
Failed to add MAC address to interface
Zebra attempted to assign a MAC address to a VXLAN interface but failed.
Notify a developer.
Zebra
HIGH
4043309096
Failed to delete VNI
Zebra attempted to delete a VNI entry and failed.
Notify a developer.
Zebra
HIGH
4043309097
Adding remote VTEP failed
Zebra attempted to add a remote VTEP and failed.
Notify a developer.
Zebra
HIGH
4043309098
Adding VNI failed
Zebra attempted to add a VNI hash to an interface and failed.
Notify a developer.
Try It Pre-built Demos
This documentation includes pre-built Try It demos for certain Cumulus Linux features. The Try It demos run a simulation in NVIDIA Air; a cloud hosted platform that works exactly like a real world production deployment. All the Try It demos use the NVIDIA Cumulus Linux reference topology.
To access a demo, click the Try It tab in a Configuration Example section of the documentation. Acknowledge the captcha and select the Launch Simulation button:
NVIDIA Air starts building the simulation and boots the nodes:
The simulation can take a few minutes to build and might display a grey screen before loading.
Run Commands
When the simulation is ready, you can log into the leaf and spine switches. The switches are pre-configured with the configuration commands shown in the documentation. You can run any Cumulus Linux commands to learn more about the feature and configure additional settings.
Launch in AIR
If you want to save the simulation or extend the run time, click LAUNCH IN AIR to access the network simulation platform. From this platform, you can run additional pre-built demos and even build your own simulations. Refer to the NVIDIA Air User Guide.
Reference Topology
The Cumulus Linux documentation uses this reference topology for configuration examples.
Cumulus Linux in a Virtual Environment
Cumulus Linux in a virtual environment enables you to become familiar NVIDIA networking technology, learn and test Cumulus Linux in your own environment, and create a digital twin of your IT infrastructure to validate configurations, features, and automation code.
Virtual Environments
NVIDIA provides these virtual environments:
NVIDIA Air is a cloud hosted platform that works exactly like a real world production deployment. You can access NVIDIA Air here and reference the NVIDIA Air User Guide for help. To get a jumpstart on your network configuration, you can go to the Demo Marketplace and run one of the pre-built feature demos, such as EVPN symmetric routing, EVPN centralized routing, EVPN multihoming, and more.
Cumulus VX is a free virtual appliance with the Cumulus Linux operating system. You can install Cumulus VX on a supported hypervisor and configure the VMs with the reference topology or create your own topology. See Cumulus VX.
Cumulus Linux in a virtual environment contains the same Cumulus Linux operating system as NVIDIA Ethernet switches and contains the same software features. You have the full data plane functionality through the Linux kernel, as well as layer 2 VLANs and both VXLAN bridging and VXLAN routing capabilities.
Unsupported Features in a Virtual Environment
Due to hardware specific implementations, virtual environments do not support certain Cumulus Linux features.