Monitoring System Hardware

You can monitor system hardware with the following commands and utilities:

NVUE
decode-syseeprom
smond
sensors
watchdog

NVUE Commands

You can run NVUE commands to monitor your system hardware.

Command	Description
`nv show system health`	Shows information about the health of the switch including the status of the ASIC, hardware, and process and describes any issues. Run this command to check real-time health metrics and view historical health data. To show system health information for a specific configuration revision, run the `nv show system health --rev <revision-id>` command. To show system health information in `json` format, run the `nv show system health -o json` command.
`nv show platform`	Shows platform hardware information on the switch, such as the model and manufacturer, memory, serial number and system MAC address.
`nv show platform firmware`	Shows switch firmware information, such as the name (BIOS or SSD), part number, and firmware source.
`nv show platform environment fan`	Shows information about the fans on the switch, such as the minimum, maximum and current speed, the fan state, and the fan direction.
`nv show platform environment led`	Shows information about the LEDs on the switch, such as the LED name and color.
`nv show platform environment psu`	Shows information about the PSUs on the switch, such as the PSU name and state.
`nv show platform environment temperature`	Shows information about the sensors on the switch, such as the critical, maximum, minimum and current temperature and the current state of the sensor.
`nv show platform environment voltage`	Shows the list of voltage sensors on the switch. Note: On the SN3700 and SN3700c switch, the `nv show platform environment voltage` command output shows a failed state for the PSU-n-12V-RAIL-OUT sensors. This is a known hardware limitation that cannot be corrected by the PSU vendor.
`nv show platform inventory`	Shows the switch inventory, which includes fan and PSU hardware version, model, serial number, state, and type. For information about a specific fan or PSU, run the `nv show platform inventory <inventory-name>` command.

The following example shows the nv show health command output when the health of the switch is not good:

cumulus@switch:~$ nv show system health
            operational  applied
----------  -----------  -------
status      Not OK          
status-led  amber

Health issues
================
    Component           Status information                                           
    ------------------  -------------------------------------------------------------
    forwarding          active (running) since Tue 2025-09-30 14:36:55 UTC; 1 day 21h ago
    hw-management       inactive                                                         
    hw-management-sync  inactive                                                         
    hw-management-tc    inactive                                                         
    mft                 inactive                                                         
    process             Not OK

The following example shows the nv show platform command output:

cumulus@switch:~$ nv show platform
               operational      
-------------  -----------------
system-mac     44:38:39:22:01:b1                               
manufacturer   Cumulus                                         
cpu            x86_64 QEMU Virtual CPU version 2.5+ x1         
memory         1.67 GB                                         
disk-size      n/a                                             
port-layout    n/a                                             
part-number    5.15.0                                          
serial-number  44:38:39:22:01:7a                               
asic-model     n/a                                             
system-uuid    51c411e8-43d2-4e60-a7e7-e068aa04b7f9            
system-type    VX

The following example shows the nv show platform environment fan command output. The airflow direction must be the same for all fans. If Cumulus Linux detects that the fan airflow direction is not uniform, it logs a message in the var/log/syslog file.

cumulus@switch:~$ nv show platform environment fan
Name      Fan State  Current Speed (RPM)  Max Speed  Min Speed  Fan Direction
--------  ---------  -------------------  ---------  ---------  -------------
FAN1/1    ok         6000                 29000      2500       F2B         
FAN1/2    ok         6000                 29000      2500       F2B         
FAN2/1    ok         6000                 29000      2500       F2B         
FAN2/2    ok         6000                 29000      2500       F2B         
FAN3/1    ok         6000                 29000      2500       F2B         
FAN3/2    ok         6000                 29000      2500       F2B         
PSU1/FAN  ok         6000                 29000      2500       F2B         
PSU2/FAN  ok         6000                 29000      2500       F2B

If the airflow direction for all fans is not in the same (front to back or back to front), cooling is suboptimal for the switch, rack, and even the entire data center.
During thermal overload or if you physically remove a fan tray while the switch is powered on, the switch reboots and none of the interfaces come up until you power cycle the switch with the fan tray reinserted or when the environmental temperature corrects. You can detect this condition with the following log message:

switch determine-reset[8801]: determine-reset-reason INFO: Platform api indicates reboot cause Thermal Overload: ASIC

decode-syseeprom Command

Use the decode-syseeprom command to retrieve information about the switch EEPROM. If the EEPROM is writable, you can set values on the EEPROM.

The following is example decode-syseeprom command output. The output is different on different switches:

cumulus@switch:~$ decode-syseeprom
TlvInfo Header:
   Id String:    TlvInfo
   Version:      1
   Total Length: 69
TLV Name             Code Len Value
-------------------- ---- --- -----
Vendor Name          0x2D  16 Cumulus Networks
Product Name         0x21   2 VX
Device Version       0x26   1 3
Part Number          0x22   5 5.15
MAC Addresses        0x2A   2 55
Base MAC Address     0x24   6 44:38:39:22:01:7A
Serial Number        0x23  17 44:38:39:22:01:7a
CRC-32               0xFE   4 0xF305A73F
(checksum valid)

The decode-syseeprom command includes the following options:

Option	Description
`-h`, `-help`	Displays the help message and exits.
`-a`	Prints the base MAC address for switch interfaces.
`-r`	Prints the number of MAC addresses allocated for the switch interfaces.
`-s`	Sets the EEPROM content (if the EEPROM is writable). You can provide arguments in the command line in a comma separated list in the form `<field>=<value>`. `.` `,` and `=` are not allowed in field names and values. Any field not specified defaults to the current value. NVIDIA Spectrum switches do not support this option.
`-j`, `--json`	Displays JSON output.
`-t <target>`	Prints the target EEPROM information (board, psu2, psu1).
`--serial`, `-e`	Prints the device serial number.
`-m`	Prints the base MAC address for the management interfaces.
`--init`	Clears and initializes the board EEPROM cache.

Run the sudo dmidecode command to retrieve hardware configuration information populated in the BIOS.

smond

The smond service monitors system units like power supply and fan, updates the corresponding LEDs, and logs the change in state. The cpld registers detect changes in system unit state. smond utilizes these registers to read all sources, which determines the health of the unit and updates the system LEDs.

Run the sudo smonctl command to display sensor information for the various system units:

cumulus@switch:~$ sudo smonctl
Fan1      (Fan Tray 1, Fan 1                     ):  OK
Fan2      (Fan Tray 1, Fan 2                     ):  OK
Fan3      (Fan Tray 2, Fan 1                     ):  OK
Fan4      (Fan Tray 2, Fan 2                     ):  OK
Fan5      (Fan Tray 3, Fan 1                     ):  OK
Fan6      (Fan Tray 3, Fan 2                     ):  OK
PSU1                                              :  OK
PSU2                                              :  OK
PSU1Fan1  (PSU1 Fan                              ):  OK
PSU1Temp1 (PSU1 Temp Sensor                      ):  OK
PSU2Fan1  (PSU2 Fan                              ):  OK
PSU2Temp1 (PSU2 Temp Sensor                      ):  OK
Temp1     (Board Sensor near CPU                 ):  OK
Temp2     (Board Sensor Near Virtual Switch      ):  OK
Temp3     (Board Sensor at Front Left Corner     ):  OK
Temp4     (Board Sensor at Front Right Corner    ):  OK
Temp5     (Board Sensor near Fan                 ):  OK

When the switch is not powered on, smonctl shows the PSU status as BAD instead of POWERED OFF or NOT DETECTED. This is a known limitation.

The smonctl command includes the following options:

Option	Description
`-s <sensor>`, `--sensor <sensor>`	Displays data for the specified sensor.
`-v`, `--verbose`	Displays detailed hardware sensors data.

The following command example shows information about FAN6 on the switch:

cumulus@switch:~$ smonctl -s FAN6 -v
Fan6      (Fan Tray 3, Fan 2                     ):  OK

For more information, read man smond and man smonctl.

sensors Command

Run the sensors command to monitor the health of your switch hardware, such as power, temperature and fan speeds. This command executes lm-sensors.

Even though you can use the sensors command to monitor the health of your switch hardware, NVIDIA recommends you use the smond daemon to monitor hardware health. See smond Daemon above.

For example:

cumulus@switch:~$ sensors
cumulus_vx_cpld-isa-0000
Adapter: ISA adapter
fan1:        6000 RPM  (min = 2500 RPM, max = 29000 RPM)
fan2:        6000 RPM  (min = 2500 RPM, max = 29000 RPM)
fan3:        6000 RPM  (min = 2500 RPM, max = 29000 RPM)
fan4:        6000 RPM  (min = 2500 RPM, max = 29000 RPM)
fan5:        6000 RPM  (min = 2500 RPM, max = 29000 RPM)
fan6:        6000 RPM  (min = 2500 RPM, max = 29000 RPM)
fan7:        6000 RPM  (min = 2500 RPM, max = 29000 RPM)
fan8:        6000 RPM  (min = 2500 RPM, max = 29000 RPM)
temp1:        +25.0°C  (low  =  +5.0°C, high = +80.0°C)
                       (crit low =  +0.0°C, crit = +85.0°C)
temp2:        +25.0°C  (low  =  +5.0°C, high = +80.0°C)
                       (crit low =  +0.0°C, crit = +85.0°C)
temp3:        +25.0°C  (low  =  +5.0°C, high = +80.0°C)
                       (crit low =  +0.0°C, crit = +85.0°C)
temp4:        +25.0°C  (low  =  +5.0°C, high = +80.0°C)
                       (crit low =  +0.0°C, crit = +85.0°C)
temp5:        +25.0°C  (low  =  +5.0°C, high = +80.0°C)
                       (crit low =  +0.0°C, crit = +85.0°C)
temp6:        +25.0°C  (low  =  +5.0°C, high = +80.0°C)
                       (crit low =  +0.0°C, crit = +85.0°C)
temp7:        +25.0°C  (low  =  +5.0°C, high = +80.0°C)
                       (crit low =  +0.0°C, crit = +85.0°C)

Output from the sensors command varies depending upon the switch.
If you only plug in one PSU, the fan is at maximum speed.

The following table shows the sensors command options.

Option	Description
`-c --config-file`	Specify a configuration file; use `-` after `-c` to read the configuration file from `stdin`; by default, `sensors` references the configuration file in `/etc/sensors.d/`.
`-s --set`	Execute set statements in the configuration file (root only); `sensors -s` runs one time at boot and applies all the settings to the boot drivers.
`-f --fahrenheit`	Show temperatures in degrees Fahrenheit.
`-A --no-adapter` `-A --bus-list`	Do not show the adapter for each chip. Generate bus statements for `sensors.conf`.
`-u`	Generate raw output.
`-j`	Generate json output.
`-v`	Show the program version.

Hardware Watchdog

Cumulus Linux includes a simplified version of the wd_keepalive(8) daemon instead of the one in the standard watchdog Debian package. wd_keepalive writes to a file called /dev/watchdog periodically (at least one time per minute) to prevent the switch from resetting. Each write delays the reboot time by another minute. After one minute of inactivity, where wd_keepalive does not write to /dev/watchdog, the switch resets itself.

Cumulus Linux enables the watchdog by default, which starts when you boot the switch (before switchd starts).

To disable the watchdog, disable and stop the wd_keepalive service:

cumulus@switch:~$ sudo systemctl disable wd_keepalive ; systemctl stop wd_keepalive

You can modify the settings for the watchdog, such as the timeout and the scheduler priority, in the /etc/watchdog.conf configuration file.

cumulus@switch:~$ sudo nano /etc/watchdog.conf
watchdog-device	= /dev/watchdog
# Set the hardware watchdog timeout in seconds
watchdog-timeout = 30
# Kick the hardware watchdog every 'interval' seconds
interval = 5
# Log a status message every (interval * logtick) seconds.  Requires
# --verbose option to enable.
logtick = 240
# Run the daemon using default scheduler SCHED_OTHER with slightly
# elevated process priority.  See man setpriority(2).
realtime = no
priority = -2

Health Monitoring Reference

The following table summarizes the events that Cumulus Linux monitors for system health:

Event Category	Component Name	Severity	Event Description	Threshold/Condition	States
Temperature Sensor	Temp1-Temp8	WARNING	Temperature sensor state is HIGH	temp ≥ max_hyst but < max	HIGH
Temperature Sensor	Temp1-Temp8	WARNING	Temperature sensor state is LOW	temp ≤ min but > lcrit	LOW
Temperature Sensor	Temp1-Temp8	CRITICAL	Temperature sensor state is CRITICAL	temp ≥ max	CRITICAL
Temperature Sensor	Temp1-Temp8	CRITICAL	Temperature sensor state is LCRITICAL	temp ≤ lcrit	LCRITICAL
Temperature Sensor	Temp1-Temp8	ERROR	Temperature sensor state is BAD	Sensor data outside limits or read failure	BAD
Temperature Sensor	Temp1-Temp8	INFO	Temperature sensor state is ABSENT	Sensor not present in system	ABSENT
Temperature Sensor	Temp1-Temp8	INFO	Temperature sensor state is OK	temp within normal range	OK
ASIC Temperature	ASIC1	WARNING	ASIC temperature is too hot	Temperature exceeds threshold	Not OK
Fan Status	Fan1-FanN	WARNING	Fan speed is out of range	Speed not within min-max range	BAD
Fan Status	Fan1-FanN	WARNING	Fan is not working	Fan status check failed	BAD
Fan Status	Fan1-FanN	ERROR	Failed to get fan speed data	Unable to read fan metrics	BAD
Fan Status	Fan1-FanN	INFO	Fan is missing	Fan hardware not detected	ABSENT
Fan Status	Fan1-FanN	CRITICAL	Fan direction mismatch	Mix of B2F and F2B fans	BAD
Fan Status	Fan1-FanN	INFO	Fan state is OK	Fan operating normally	OK
Power Supply	PSU1, PSU2	ERROR	PSU state is BAD	PSU failure or malfunction	BAD
Power Supply	PSU1, PSU2	INFO	PSU is ABSENT	PSU not installed or detected	ABSENT
Power Supply	PSU1, PSU2	INFO	PSU state is OK	PSU operating normally	OK
PSU Fan	PSU1Fan1, PSU2Fan1	WARNING	PSU fan failure	Fan in PSU module failed	BAD
PSU Temperature	PSU1Temp1, PSU2Temp1	WARNING	PSU temperature out of range	Temperature sensor in PSU reporting issues	HIGH/CRITICAL
CPU Utilization	CpuStatus	ALERT	High CPU use	80% ≤ CPU < 95%	Alert
CPU Utilization	CpuStatus	CRITICAL	Critically high CPU use	CPU ≥ 95%	Critical
CPU Utilization	CpuStatus	INFO	CPU use no longer high	CPU < 80% (recovered)	OK
CPU Status	cpu	WARNING	CPU status is Not OK	Critically high CPU in last 5 min	Not OK
Memory Usage	MemoryStatus	ALERT	Low free memory	90% ≤ Memory < 95%	Alert
Memory Usage	MemoryStatus	CRITICAL	Critically low free memory	Memory ≥ 95%	Critical
Memory Usage	MemoryStatus	INFO	Free memory no longer low	Memory < 90% (recovered)	OK
Root Filesystem	disk	ALERT	Filesystem nearly full	90% ≤ Disk < 95%	Alert
Root Filesystem	disk	CRITICAL	Filesystem critically full	Disk ≥ 95%	Critical
/var/log Partition	var-log	ALERT	/var/log partition nearly full	/var/log usage ≥ 90%	Alert
System Load	LoadAverage	ALERT	High load average	95% ≤ Load < 125% (5-min avg per core)	Alert
System Load	LoadAverage	CRITICAL	Critically high load average	Load ≥ 125% (5-min avg per core)	Critical
ASIC Status	ASIC	ERROR	ASIC state is Not OK	ASIC not detected via lspci	Not OK
ASIC Status	ASIC	CRITICAL	ASIC thermal reset detected	System rebooted due to ASIC thermal shutdown	Not OK (Thermal Reset)
ASIC Status	ASIC	INFO	ASIC state is OK	ASIC detected and functioning	OK
Service Status	switchd	ERROR	switchd service not active	switchd is inactive or failed	inactive/failed
Service Status	frr	ERROR	FRR routing service not active	frr service is inactive	inactive
Service Status	nvued	ERROR	NVUE daemon not active	nvued is inactive or failed	inactive/failed
Service Status	lldpd, smond, etc.	WARNING	Service not active	Service is inactive	inactive
Transceiver Temp	swp1-swpN	WARNING	Transceiver temperature high alarm	Module temp high alarm/warning ON	Not OK
Transceiver Temp	swp1-swpN	WARNING	Transceiver temperature low alarm	Module temp low alarm/warning ON	Not OK
Transceiver Status	transceiver	INFO	Transceiver status is OK	All transceivers operating normally	OK

Monitoring System Hardware

NVUE Commands

decode-syseeprom Command

smond

sensors Command

Hardware Watchdog

Health Monitoring Reference

Products

Solutions

Learn

Monitoring System Hardware

NVUE Commands

decode-syseeprom Command

smond

sensors Command

Hardware Watchdog

Health Monitoring Reference

Related Information