RDMA over Converged Ethernet - RoCE
RoCE enables you to write to compute or storage elements using RDMA over an Ethernet network instead of using host CPUs. RoCE relies on ECN and PFC to operate. Cumulus Linux supports features that can enable lossless Ethernet for RoCE environments.
While Cumulus Linux can support RoCE environments, the end hosts must support the RoCE protocol.
RoCE helps you obtain a converged network, where all services run over the Ethernet infrastructure, including Infiniband applications.
Default RoCE Configuration
The following table shows the default RoCE configuration for lossy and lossless mode.
Configuration | Lossy Mode | Lossless Mode |
---|---|---|
Port trust mode | YES | YES |
Port switch priority to traffic class mapping
|
YES | YES |
Port ETS:
|
YES | YES |
Port ECN absolute threshold is 1501500 bytes for traffic class 3 (RoCE) | YES | YES |
LLDP and Application TLV (RoCE) (UDP, Protocol:4791, Priority: 3) |
YES | YES |
Enable PFC on switch priority 3 (RoCE) | NO | YES |
Switch priority 3 allocated to RoCE lossless traffic pool | NO | YES |
RoCE lossless (with PFC and ECN)
RoCE uses the Infiniband (IB) Protocol over converged Ethernet. The IB global route header rides directly on top of the Ethernet header. The lossless Ethernet layer handles congestion hop by hop.
To enable RoCE lossless:
cumulus@switch:~$ nv set qos roce
cumulus@switch:~$ nv config apply
NVUE defaults to RoCE lossless. The command nv set qos roce
and nv set qos roce mode lossless
are equivalent.
If you enable roce mode lossy
, configuring nv set qos roce
without a mode
does not change the RoCE mode. To change to lossless, you must configure lossless mode with the nv set qos roce mode lossless
command.
Edit the /etc/cumulus/switchd.d/qos.conf
file to set the traffic.roce_mode
parameter to 3, then reload switchd
.
cumulus@switch:~$ sudo cat /etc/cumulus/switchd.d/qos.conf
...
traffic.roce_mode = 3
...
cumulus@switch:~$ sudo systemctl reload switchd.service
Link pause is another way to provide lossless ethernet; however, PFC is the preferred method. PFC allows more granular control by pausing the traffic flow for a given CoS group instead of the entire link.
RoCE lossy (with ECN)
RoCEv2 requires flow control for lossless Ethernet. RoCEv2 uses the Infiniband (IB) Transport Protocol over UDP. The IB transport protocol includes an end-to-end reliable delivery mechanism and has its own sender notification mechanism.
RoCEv2 congestion management uses RFC 3168 to signal congestion experienced to the receiver. The receiver generates an RoCEv2 congestion notification packet directed to the source of the packet.
To enable RoCE lossy:
cumulus@switch:~$ nv set qos roce mode lossy
cumulus@switch:~$ nv config apply
Edit the /etc/cumulus/switchd.d/qos.conf
file to set the traffic.roce_mode
parameter to 1, then reload switchd
.
cumulus@switch:~$ sudo cat /etc/cumulus/switchd.d/qos.conf
...
traffic.roce_mode = 1
...
cumulus@switch:~$ sudo systemctl reload switchd.service
Single Shared Buffer Pool
By default, Cumulus Linux separates lossy and lossless traffic into different dedicated buffer pools on both ingress and egress. You can configure the switch to combine lossy and lossless traffic on the same buffer pool for better load absorption.
To enable single shared buffer pool mode:
cumulus@switch:~$ nv set qos roce mode lossless-single-ipool
cumulus@switch:~$ nv config apply
To disable single shared buffer pool mode and use the default mode (lossless), run the nv unset qos roce mode lossless-single-ipool
command.
Edit the /etc/cumulus/switchd.d/qos.conf
file to set the traffic.roce_mode
parameter to 4, then reload switchd
.
cumulus@switch:~$ sudo cat /etc/cumulus/switchd.d/qos.conf
...
traffic.roce_mode = 4
...
cumulus@switch:~$ sudo systemctl reload switchd.service
To disable single shared buffer pool mode and use the default mode (lossless), set the traffic.roce_mode
parameter to 3.
Remove RoCE Configuration
To remove RoCE configuration:
cumulus@switch:~$ nv unset qos roce
cumulus@switch:~$ nv config apply
Edit the etc/cumulus/switchd.d/qos.conf
file to set the traffic.roce_mode
parameter to 0, then reload switchd
.
cumulus@switch:~$ sudo cat etc/cumulus/switchd.d/qos.conf
...
traffic.roce_mode = 0
...
cumulus@switch:~$ sudo systemctl reload switchd.service
Verify RoCE Configuration
You can verify RoCE configuration with NVUE nv show
commands.
To show detailed information about the configured buffers, utilization and DSCP markings, run the nv show qos roce
command:
cumulus@switch:mgmt:~$ nv show qos roce
operational applied
------------------ ----------- --------
operational applied
------------------ --------------------- ---------------------
enable on
mode lossless-single-ipool lossless-single-ipool
pfc
pfc-priority 3
rx-enabled enabled
tx-enabled enabled
cable-length 100
congestion-control
congestion-mode ECN
enabled-tc 0,3
min-threshold 146.48 KB
max-threshold 1.43 MB
probability 100
trust
trust-mode pcp,dscp
lldp-app-tlv
priority -1
protocol-id -1
selector Non-UDP
RoCE PCP/DSCP->SP mapping configurations
===========================================
pcp dscp switch-prio
- --- ----------------------- -----------
0 0 0,1,2,3,4,5,6,7 0
1 1 8,9,10,11,12,13,14,15 1
2 2 16,17,18,19,20,21,22,23 2
3 3 24,25,26,27,28,29,30,31 3
4 4 32,33,34,35,36,37,38,39 4
5 5 40,41,42,43,44,45,46,47 5
6 6 48,49,50,51,52,53,54,55 6
7 7 56,57,58,59,60,61,62,63 7
RoCE SP->TC mapping and ETS configurations
=============================================
switch-prio traffic-class scheduler-weight
- ----------- ------------- ----------------
0 0 0 DWRR-50%
1 1 0 DWRR-50%
2 2 0 DWRR-50%
3 3 3 DWRR-50%
4 4 0 DWRR-50%
5 5 0 DWRR-50%
6 6 6 strict-priority
7 7 0 DWRR-50%
RoCE pool config
===================
name mode size switch-priorities traffic-class
- --------------------- ------- ---- ----------------- -------------
0 lossy-default-ingress Dynamic 50% 0,1,2,4,5,6,7 -
1 roce-reserved-ingress Dynamic 50% 3 -
2 lossy-default-egress Dynamic 50% - 0,6
3 roce-reserved-egress Dynamic inf - 3
Exception List
=================
No Data
To show detailed RoCE information about a single interface, run the nv show interface <interface> qos roce status
command.
cumulus@switch:mgmt:~$ nv show interface swp16 qos roce status
operational applied description
------------------ ------------- ------- ---------------------------------------------------
congestion-control
congestion-mode ecn, absolute Congestion config mode
enabled-tc 0,3 Congestion config enabled Traffic Class
max-threshold 1.43 MB Congestion config max-threshold
min-threshold 153.00 KB Congestion config min-threshold
probability 100
lldp-app-tlv
priority 3
protocol-id 4791
selector UDP
pfc
pfc-priority 3 switch-prio on which PFC is enabled
rx-enabled yes PFC Rx Enabled status
tx-enabled yes PFC Tx Enabled status
trust
trust-mode pcp,dscp Trust Setting on the port for packet classification
mode lossless Roce Mode
RoCE PCP/DSCP->SP mapping configurations
===========================================
pcp dscp switch-prio
---- --- ---- -----------
cnp 6 48 6
roce 3 26 3
RoCE SP->TC mapping and ETS configurations
=============================================
switch-prio traffic-class scheduler-weight
---- ----------- ------------- ----------------
cnp 6 6 strict priority
roce 3 3 dwrr-50%
RoCE Pool Status
===================
name mode pool-id switch-priorities traffic-class size current-usage max-usage
-- --------------------- ------- ------- ----------------- ------------- -------- ------------- ---------
0 lossy-default-ingress DYNAMIC 2 0,1,2,4,5,6,7 - 15.16 MB 0 Bytes 16.00 MB
1 roce-reserved-ingress DYNAMIC 3 3 - 15.16 MB 7.30 MB 7.90 MB
2 lossy-default-egress DYNAMIC 13 - 0,6 15.16 MB 0 Bytes 16.01 MB
3 roce-reserved-egress DYNAMIC 14 - 3 inf 7.29 MB 13.47 MB
To show detailed information about current buffer utilization as well as historic RoCE byte and packet counts, run the nv show interface <interface> qos roce counters
command:
cumulus@switch:mgmt:~$ nv show interface swp16 qos roce counters
operational applied description
----------------------------- ------------ ------- ------------------------------------------------------
rx-stats
rx-non-roce-stats
buffer-max-usage 144 Bytes Max Ingress Pool-buffer usage for non-RoCE traffic
buffer-usage 0 Bytes Current Ingress Pool-buffer usage for non-RoCE traffic
no-buffer-discard 55 Rx buffer discards for non-RoCE traffic
non-roce-bytes 56.52 MB non-roce rx bytes
non-roce-packets 462975 non-roce rx packets
pg-max-usage 144 Bytes Max PG-buffer usage for non-RoCE traffic
pg-usage 0 Bytes Current PG-buffer usage for non-RoCE traffic
rx-pfc-stats
pause-duration 0 Rx PFC pause duration for RoCE traffic
pause-packets 0 Rx PFC pause packets for RoCE traffic
rx-roce-stats
buffer-max-usage 0 Bytes Max Ingress Pool-buffer usage for RoCE traffic
buffer-usage 0 Bytes Current Ingress Pool-buffer usage for RoCE traffic
no-buffer-discard 0 Rx buffer discards for RoCE traffic
pg-max-usage 0 Bytes Max PG-buffer usage for RoCE traffic
pg-usage 0 Bytes Current PG-buffer usage for RoCE traffic
roce-bytes 0 Bytes Rx RoCE Bytes
roce-packets 0 Rx RoCE Packets
tx-stats
tx-cnp-stats
buffer-max-usage 16.02 MB Max Egress Pool-buffer usage for CNP traffic
buffer-usage 0 Bytes Current Egress Pool-buffer usage for CNP traffic
cnp-bytes 0 Bytes Tx CNP Packet Bytes
cnp-packets 0 Tx CNP Packets
tc-max-usage 0 Bytes Max TC-buffer usage for CNP traffic
tc-usage 0 Bytes Current TC-buffer usage for CNP traffic
unicast-no-buffer-discard 0 Tx buffer discards for CNP traffic
tx-ecn-stats
ecn-marked-packets 693777677344 Tx ECN marked packets
tx-pfc-stats
pause-duration 0 Tx PFC pause duration for RoCE traffic
pause-packets 0 Tx PFC pause packets for RoCE traffic
tx-roce-stats
buffer-max-usage 13.47 MB Max Egress Pool-buffer usage for RoCE traffic
buffer-usage 7.29 MB Current Egress Pool-buffer usage for RoCE traffic
roce-bytes 92824.38 GB Tx RoCE Packet bytes
roce-packets 803785675319 Tx RoCE Packets
tc-max-usage 16.02 MB Max TC-buffer usage for RoCE traffic
tc-usage 7.29 MB Current TC-buffer usage for RoCE traffic
unicast-no-buffer-discard 663060754115 Tx buffer discards for RoCE traffic
To reset the counters in the nv show interface <interface> qos roce
command output, run the nv action clear interface <interface> qos roce counters
command.
Change RoCE Configuration
You can adjust RoCE settings using NVUE after you enable RoCE. To change the memory allocation for RoCE lossless mode to 60 percent:
cumulus@switch:mgmt:~$ nv set qos traffic-pool default-lossy memory-percent 40
cumulus@switch:mgmt:~$ nv set qos traffic-pool roce-lossless memory-percent 60
cumulus@switch:mgmt:~$ nv config apply
To change the memory allocation of the RoCE lossy traffic pool to 60 percent and remap switch priority 4 to RoCE lossy traffic:
cumulus@switch:mgmt:~$ nv set qos traffic-pool default-lossy switch-priority 0-3,5-7
cumulus@switch:mgmt:~$ nv set qos traffic-pool roce-lossy memory-percent 60
cumulus@switch:mgmt:~$ nv set qos traffic-pool default-lossy memory-percent 40
cumulus@switch:mgmt:~$ nv set qos traffic-pool roce-lossy switch-priority 4
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 4 traffic-class 3
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 3 traffic-class 0
cumulus@switch:mgmt:~$ nv set qos mapping default-global trust both
cumulus@switch:mgmt:~$ nv set qos mapping default-global dscp 26 switch-priority 4
cumulus@switch:mgmt:~$ nv config apply
To change the RoCE lossless switch priority from switch priority 3 to switch priority 2:
cumulus@switch:mgmt:~$ nv set qos pfc default-global switch-priority 2
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 2 traffic-class 3
cumulus@switch:mgmt:~$ nv set qos egress-queue-mapping default-global switch-priority 3 traffic-class 0
cumulus@switch:mgmt:~$ nv set qos mapping default-global trust both
cumulus@switch:mgmt:~$ nv set qos mapping default-global dscp 26 switch-priority 2
Related Information
- RoCE introduction - roceinitiative.org