RX Error Counters and Slow Throughput Performance
Issue
There might be slow throughput performance in a switch and RX error counters are incrementing, as well as possibly TX error counters. You might see these error counters in the output of different commands:
cl-netstat
ip -s link show
ethtool -S
If you want to monitor the output of these commands to see the statistics live as they change, use the Linux watch
command. For more information on using this command, refer to the following article.
cl-netstat Shows RX_ERR
You can see RX error counters in the output of cl-netstat
as RX_ERR
, as shown below.
cumulus@switch$ cl-netstat
Kernel Interface table
Iface MTU Met RX_OK RX_ERR RX_DRP RX_OVR TX_OK TX_ERR TX_DRP TX_OVR Flg
---------------------------------------------------------------------------------------------
eth0 1500 0 7361728 0 0 0 2030188 0 0 0 BMRU
lo 16436 0 173 0 0 0 173 0 0 0 LRU
swp1 9000 0 7669976333 15682741 1439 0 3035723493 0 0 0 BMRU
swp2 9000 0 3023667770 10728822 978 0 9840616134 0 0 0 BMRU
swp3 9000 0 24315580462 14877988 1307 0 80763548753 0 0 0 BMRU
swp4 9000 0 13869960451 8452232 897 0 7477191326 0 0 0 BMRU
<Output is truncated>
For additional information on how to use the cl-netstat
command, read the user guide.
ip -s link show Shows RX errors
You can see RX error counters in the output of ip -s link show
, as shown below.
cumulus@switch$ ip -s link show swp5
7: swp5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc pfifo_fast state UP mode DEFAULT qlen 500
link/ether 08:9e:01:ce:e0:6c brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
8552309 71000 1899 1 0 63108
TX: bytes packets errors dropped carrier collsns
1940799 15779 0 0 0 0
ethtool -S Shows HwIfInErrors
You can see RX error counters in the output of ethtool -S <interface>
as HwIfInErrors
, as shown below.
cumulus@switch$ ethtool -S swp1
NIC statistics:
HwIfInOctets: 51883086875273
HwIfInUcastPkts: 7669711571
HwIfInBcastPkts: 0
HwIfInMcastPkts: 264791
HwIfOutOctets: 10590370555531
HwIfOutUcastPkts: 3035458717
HwIfOutMcastPkts: 264792
HwIfOutBcastPkts: 0
HwIfInDiscards: 1439
HwIfInL3Drops: 0
HwIfInBufferDrops: 1439
HwIfInAclDrops: 115
HwIfInDot3LengthErrors: 0
HwIfInErrors: 15682741
SoftInErrors: 0
SoftInDrops: 0
SoftInFrameErrors: 0
HwIfOutDiscards: 0
HwIfOutErrors: 0
HwIfOutQDrops: 0
HwIfOutNonQDrops: 0
SoftOutErrors: 0
SoftOutDrops: 0
SoftOutTxFifoFull: 0
HwIfOutQLen: 0
Environment
- Cumulus Linux, all versions
Overview
Cause of the Errors
These RX_ERR or HwIfInErrors indicate some Ethernet data frames became corrupted somewhere along the transmission line, typically due to some bad cable or transceiver. The cyclic redundancy check (CRC) algorithm in the Frame Check Sequences (FCS) calculation might detect these errors.
When the switch receives a frame, it runs its own checksum on the frame and compares the resulting CRC value to the value in the Ethernet frame. If they are not equal, it means some bits got corrupted and thus the switch counts these as RX errors. In half-duplex mode, some FCS errors might be normal. In full-duplex mode, FCS errors are not normal.
Propagation of the Errors
When a platform detects an FCS error, what the platform does with the Ethernet frame depends on which switching mode you configured, one of either cut-through or store and forward. In cut-through mode, the frame with the FCS error might propagate to the next switch. In store and forward mode, the frame with the FCS error gets discarded.
Cut-through Switching Mode
You use the cut-through mode of forwarding to minimize the latency (delay) through the switch by beginning the forwarding process before receiving the entire packet from the upstream sender. The data might begin transmitting while the inbound interface is still receiving it, which minimizes the time the switch holds the packet, and thus minimizes delays in propagation. The disadvantage is that data frames with FCS errors might propagate to the next hop because transmission out of the switch begins before detecting the FCS error. Becaus the next hop switch would have begun receiving this packet with no indication of a problem with the packet, it might also begin transmitting to its outbound interface before detecting the FCS error, thereby propagating the error even further.
Store and Forward Switching Mode
As the name implies, store and forward waits until the switch receives and validates the entire packet before starting the transmit process on the outbound interface. This allows the switch to verify that the received packet is valid before sending it onward, but it increases latency by holding each packet longer in buffers in the switch. It might also increase buffer utilization by having each packet utilize the resources for a longer period of time. If you configure store and forward, the platform is able to detect FCS errors before beginning transmission, and thus can discard the frame and not propagate the errors to the next hop.
Resolution
Replace the Bad Components
The frame corruption occurs because of some bad component somewhere in the data path, such as cables or transceivers. Trace the RX errors upstream across all the hops in the end-to-end data path
You can use lldpctl
to trace the ports upstream, hop-by-hop. Here is an example output:
cumulus@switch$ lldpctl
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface: eth0, via: LLDP, RID: 1, Time: 0 day, 23:36:08
Chassis:
ChassisID: mac 6c:64:1a:00:2f:54
SysName: backbone
SysDescr: Cumulus Linux version 2.5.2 running on cel kennisis
MgmtIP: 192.168.1.5
Capability: Bridge, on
Capability: Router, on
Port:
PortID: ifname swp21
PortDescr: swp21
-------------------------------------------------------------------------------
Interface: swp1, via: LLDP, RID: 5, Time: 0 day, 05:51:40
Chassis:
ChassisID: mac 08:9e:01:ce:e4:0c
SysName: sw23
SysDescr: Cumulus Linux version 2.5.2 running on quanta ly2
MgmtIP: 192.168.2.30
Capability: Bridge, off
Capability: Router, on
Port:
PortID: ifname swp7
PortDescr: swp7
-------------------------------------------------------------------------------
After you identify the source point, try replacing the cable or transceiver to resolve the component introducing the data corruption.
Change the Switching Mode
While cut-through forwarding decreases latency and buffer consumption, one of its disadvantages is that packets are not verified as valid before they begin transmission on the outbound interface. Thus forwarding might begin out of the output interface before detecting the FCS error.
By changing from cut-through to store and forward mode of forwarding operation, Cumulus Linux verifies each packet as correct before the forwarding process begins, limiting the reach of any corrupt packets. This verification comes at the cost of potential increased latency and buffer consumption.
You need to configure these switches in the data path:
- The switches upstream from the switch with the RX errors (that is, the previous switches in the data path) to eliminate the RX errors on the switch in question
- The switch showing the RX errors to prevent it from propagating the errors to the downstream switch (that is, the next switch in the data path)
To change the forwarding behavior from cut-through to store and forward on switches with Broadcom ASICs:
-
Run the following command:
cumulus@switch$ sudo vi /etc/cumulus/datapath/traffic.conf
-
Search for the following line in the
traffic.conf
file:# To enable cut-through forwarding cut_through_enable = true
-
Modify the value of
cut_through_enable
to false:# To enable cut-through forwarding cut_through_enable = false
-
To let the change in forwarding mode take effect, restart
switchd
. Note that restarting theswitchd
daemon is minimally disruptive.cumulus@switch$ sudo service switchd restart
Considerations
While these instructions discuss how to change the mode of operation for forwarding on a Cumulus Linux switch, the default setting of cut-though is the recommended value in almost every circumstance. If you make this change on a switch for testing purposes, you should continue to monitor its performance.