Monitoring Best Practices
The following monitoring processes are best practices for reviewing and troubleshooting potential issues with Cumulus Linux environments.
This document describes:
- Metrics that you can poll from Cumulus Linux and use in trend analysis
- Critical log messages that you can monitor for triggered alerts
Trend Analysis Using Metrics
A metric is a quantifiable measure that tracks and assesses the status of a specific infrastructure component. Examples of metrics include bytes on an interface, CPU utilization, and total number of routes.
Metrics are more valuable when you use them for trend analysis.
Generate Alerts with Triggered Logging
Cumulus Linux typically sends triggered issues to syslog
, but can send issues to another log file depending on the feature. rsyslog
handles all logging, including local and remote logging. Logs are the best method to use for generating alerts when the system transitions from a stable steady state.
Sending logs to a centralized collector, then creating alerts that you base on critical logs is an optimal solution.
Log Formatting
Most log files in Cumulus Linux use a standard presentation format. For example:
2017-03-08T06:26:43.569681+00:00 leaf01 sysmonitor: Critically high CPU use: 99%
- 2017-03-08T06:26:43.569681+00:00 is the timestamp.
- leaf01 is the hostname.
- sysmonitor is the process that is the source of the message.
- Critically high CPU use: 99% is the message.
For brevity and legibility, this section omits the timestamp and hostname from examples.
Hardware
NVUE provides commands to monitor various switch hardware elements.
Command | Description |
---|---|
nv show platform environment fan |
Shows information about the fans on the switch, such as the minimum, maximum and current speed, the fan state, and the fan direction. |
nv show platform environment led |
Shows information about the LEDs on the switch, such as the LED name and color. |
nv show platform environment psu |
Shows information about the PSUs on the switch, such as the PSU name and state. |
nv show platform environment temperature |
Shows information about the sensors on the switch, such as the critical, maximum, minimum and current temperature and the current state of the sensor. |
nv show platform environment voltage |
Shows the list of voltage sensors on the switch. |
nv show platform inventory |
Shows the switch inventory, which includes fan and PSU hardware version, model, serial number, state, and type. For information about a specific fan or PSU, run the nv show platform inventory <inventory-name> command. |
The following example shows the nv show platform environment fan
command output. The airflow direction must be the same for all fans. If Cumulus Linux detects that the fan airflow direction is not uniform, it logs a message in the var/log/syslog
file.
cumulus@switch:~$ nv show platform environment fan
Name Fan State Current Speed (RPM) Max Speed Min Speed Fan Direction
-------- --------- ------------------- --------- --------- -------------
FAN1/1 ok 6000 29000 2500 F2B
FAN1/2 ok 6000 29000 2500 F2B
FAN2/1 ok 6000 29000 2500 F2B
FAN2/2 ok 6000 29000 2500 F2B
FAN3/1 ok 6000 29000 2500 F2B
FAN3/2 ok 6000 29000 2500 F2B
PSU1/FAN ok 6000 29000 2500 F2B
PSU2/FAN ok 6000 29000 2500 F2B
If the airflow direction for all fans is not in the same (front to back or back to front), cooling is suboptimal for the switch, rack, and even the entire data center.
The smond
process provides monitoring for various switch hardware elements. Minimum or maximum values depend on the flags you apply to the basic command. The table below lists the hardware elements and applicable commands and flags.
Hardware Element | Monitoring Commands | Interval Poll |
---|---|---|
Temperature | smonctl -j smonctl -j -s TEMP[X] |
10 seconds |
Fan | smonctl -j smonctl -j -s FAN[X] |
10 seconds |
PSU | smonctl -j smonctl -j -s PSU[X] |
10 seconds |
PSU Fan | smonctl -j smonctl -j -s PSU[X]Fan[X] |
10 seconds |
PSU Temperature | smonctl -j smonctl -j -s PSU[X]Temp[X] |
10 seconds |
Voltage | smonctl -j smonctl -j -s Volt[X] |
10 seconds |
Front Panel LED | ledmgrd -d ledmgrd -j |
5 seconds |
Not all switch models include a sensor for monitoring power consumption and voltage. See this note for details.
Hardware Logs | Log Location | Log Entries |
---|---|---|
High temperature | /var/log/syslog |
/usr/sbin/smond : : Temp1(Board Sensor near CPU): state changed from UNKNOWN to OK |
Fan speed issues | /var/log/syslog |
/usr/sbin/smond : : Fan1(Fan Tray 1, Fan 1): state changed from UNKNOWN to OK |
Fan direction issue | /var/log/syslog |
/usr/sbin/smond : : Fan direction mismatch: 12 fans B2F; 1 fans F2B! |
PSU failure | /var/log/syslog |
/usr/sbin/smond : : PSU1Fan1(PSU1 Fan): state changed from UNKNOWN to OK |
System Data
Cumulus Linux includes several ways to monitor system data. In addition, you can receive alerts in high risk situations.
CPU Idle Time
When a CPU reports five high CPU alerts within a span of five minutes, the switch logs an alert.
Short bursts of high CPU can occur during switchd
churn or routing protocol startup. Do not set alerts for these short bursts.
System Element | Monitoring Commands | Interval Poll |
---|---|---|
CPU utilization | NVUE: nv show system cpu Linux: sudo cat /proc/stat top -b -n 1 |
30 seconds |
CPU Logs | Log Location | Log Entries |
---|---|---|
High CPU | /var/log/syslog |
sysmonitor: Critically high CPU use: 99% |
Cumulus Linux monitors CPU, memory, and disk space with sysmonitor
. The configurations for the thresholds are in /etc/cumulus/sysmonitor.conf
. For more information, see man sysmonitor
.
CPU measure | Thresholds |
---|---|
Use | Alert: 90% Crit: 95% |
Process Load | Alarm: 95% Crit: 125% |
Spectrum 1 CPUs can become overloaded at moderate to high network scale. If your Spectrum 1 switch is not able to process CPU-destined traffic or is running continually at high CPU, either reduce the scale of the network where you deploy Spectrum 1 switches or replace the switch with a newer generation switch that offers stronger compute resources.
Disk Usage
To monitor disk utilization such as the total storage capacity of the filesystem, the amount of space currently being used, the amount of free space available, the percentage of the filesystem’s total capacity currently in use, and the directory or mount point where the filesystem is attached to the system, run the NVUE nv show system disk usage
command or the Linux sudo df
command.
cumulus@switch:~$ nv show system disk usage
Mount Point Filesystem Size Used Avail Use%
----------- ---------- -- --------- ---- ----
/ /dev/sda5 5.4G 3.0G 2.2G 58%
/dev udev 2.0G 0 2.0G 0%
/dev/shm tmpfs 2.1G 61M 2.0G 3%
/run tmpfs 411M 38M 374M 10%
/run/lock tmpfs 5.0M 0 5.0M 0%
/tmp tmpfs 2.1G 12K 2.1G 1%
/vagrant vagrant 4.3T 3.1T 1.3T 72%
- To show the disk usage in json format, run the
nv show system disk usage -o json
command. - To show the disk usage in json yaml, run the
nv show system disk usage -o yaml
command.
When monitoring disk utilization with the Linux command, you can exclude the tmpfs
filesystem with sudo df -x tmpfs
.
cumulus@switch:~$ sudo df -x tmpfs
Filesystem 1K-blocks Used Available Use% Mounted on
udev 867272 0 867272 0% /dev
/dev/vda5 5646348 2417272 2921624 46% /
/dev/vdb 354 354 0 100% /mnt/air
Process Restart
In Cumulus Linux, systemd
monitors and restarts processes.
To view processes that systemd
monitors, run the systemctl status
command.
cumulus@switch:~$ systemctl status
● leaf01
State: running
Units: 521 loaded (incl. loaded aliases)
Jobs: 0 queued
Failed: 0 units
Since: Wed 2024-11-13 19:16:28 UTC; 4 weeks 0 days ago
systemd: 252.30-1~deb12u2
CGroup: /
├─1001 bpfilter_umh
├─init.scope
│ └─1 /sbin/init
└─system.slice
├─acpid.service
│ └─850 /usr/sbin/acpid
├─auditd.service
│ └─373 /sbin/auditd
├─cl-system-services.service
│ └─1182 /usr/sbin/cl_system_services -l INFO
├─clagd.service
│ └─2550 /usr/bin/python3 -u /usr/sbin/clagd --daemon linklocal pe>
├─cron.service
│ └─869 /usr/sbin/cron -f -L 38
├─csmgrd.service
Layer 1 Protocols and Interfaces
Link and port state interface transitions log to /var/log/syslog
and /var/log/switchd.log
.
Interface Element | Monitoring Commands |
---|---|
Link state | NVUE: nv show interface <interface> Linux: sudo cat /sys/class/net/<interface>/operstate |
Link speed | NVUE: nv show interface <inteface> Linux: sudo cat /sys/class/net/<interface>/speed |
Port state | NVUE: nv show interface Linux: ip link show |
Bond state | NVUE: nv show interface <bond> Linux: sudo cat /proc/net/bonding/<bond> |
You obtain interface counters from either querying the hardware or the Linux kernel. The Linux kernel aggregates the output from the hardware.
Interface Counter Element | Monitoring Commands | Interval Poll |
---|---|---|
Interface counters | NVUE: nv show interface <interface> counters Linux: cat /sys/class/net/<interface>/statistics/<statistic-name> cl-netstat -j ethtool -S <interface> |
10 seconds |
Layer 1 Logs | Log Location | Log Entries |
---|---|---|
Link failure/Link flap | /var/log/switchd.log |
switchd[5692]: nic.c:213 nic_set_carrier: swp17: setting kernel carrier: down |
Unidirectional link | /var/log/switchd.log |
ptmd[7146]: ptm_bfd.c:2471 Created new session 0x1 with peer 10.255.255.11 port swp1 |
Bond Negotiation Working | /var/log/syslog |
kernel: [85412.763193] bonding: bond0 is being created… |
Bond Negotiation Failing | /var/log/syslog |
kernel: [85412.763193] bonding: bond0 is being created… |
MLAG peerlink negotiation Working | /var/log/syslog |
lldpd[998]: error while receiving frame on swp50: Network is down |
/var/log/clagd.log |
clagd[14003]: Cleanup is executing. |
|
MLAG peerlink negotiation Failing | /var/log/syslog |
lldpd[998]: error while receiving frame on swp50: Network is down |
/var/log/clagd.log |
clagd[26916]: Cleanup is executing. |
|
MLAG port negotiation Working | /var/log/syslog |
kernel: [77419.112195] bonding: server01 is being created… |
/var/log/clagd.log |
clagd[14003]: server01 is now dual connected. |
|
MLAG port negotiation Failing | /var/log/syslog |
kernel: [79290.290999] bonding: server01 is being created… |
/var/log/clagd.log |
clagd[14291]: Conflict (server01): matching clag-id (1) not configured on peer… |
|
MLAG port negotiation Flapping | /var/log/syslog |
mstpd: one_clag_cmd: setting (0) mac 00:00:00:00:00:00 <server01, None> |
/var/log/clagd.log |
clagd[14291]: server01 is no longer dual connected |
PTM uses LLDP information to compare against a topology.dot
file that describes the network. It has built in alerting capabilities. Use PTM on the switch instead of polling LLDP information regularly. You can install PTM from the Cumulus Linux GitHub repository.
Consider tracking peering information through PTM. For more information, refer to the Prescriptive Topology Manager documentation.
Neighbor Element | Monitoring Commands | Interval Poll |
---|---|---|
LLDP Neighbor | sudo lldpctl -f json |
300 seconds |
Prescriptive Topology Manager | ptmctl -j |
Triggered |
Layer 2 Protocols
Spanning tree is a protocol that prevents loops in a layer 2 infrastructure. In a stable state, the spanning tree protocol converges. Monitor the Topology Change Notifications (TCN) in STP to identify when new BPDUs arrive.
Interface Counter Element | Monitoring Commands | Interval Poll |
---|---|---|
STP TCN Transitions | NVUE: nv show bridge domain <bridge> stp Linux: mstpctl showbridge json mstpctl showport |
60 seconds |
MLAG peer state | NVUE: nv show mlag Linux: clagctl status sudo clagd -j sudo cat /var/log/clagd.log |
60 seconds |
MLAG peer MACs | NVUE: nv show mlag Linux: clagctl dumppeermacs clagctl dumpourmacs |
300 seconds |
Layer 2 Logs | Log Location | Log Entries |
---|---|---|
Spanning Tree Working | /var/log/syslog |
kernel: [1653877.190724] device swp1 entered promiscuous mode |
Spanning Tree Blocking | /var/log/syslog |
mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Designated) |
Layer 3 Protocols
When FRR boots up for the first time, there is a different log file for each activated daemon. If you edit the log file (for example, through vtysh or frr.conf
), the integrated configuration sends all logs to the same file.
To send FRR logs to syslog, apply the configuration log syslog
in vtysh.
BGP
When monitoring BGP, check if BGP peers are operational. There is not much value in alerting on the current operational state of the peer; monitoring the transition is more valuable, which you can do by monitoring syslog
.
Monitoring the routing table provides trending on the size of the infrastructure. This is useful when you integrate with host-based solutions (such as Routing on the Host) when the routes track with the number of applications available.
BGP Element | Monitoring Commands | Interval Poll |
---|---|---|
BGP peer failure | sudo vtysh -c "show ip bgp summary json" |
60 seconds |
BGP route table | sudo vtysh -c "show ip bgp json" |
600 seconds |
BGP Logs | Log Location | Log Entries |
---|---|---|
BGP peer down | /var/log/syslog |
bgpd[3000]: %NOTIFICATION: sent to neighbor swp1 4/0 (Hold Timer Expired) 0 bytes |
OSPF
When monitoring OSPF, check if OSPF peers are operational. There is not much value in alerting on the current operational state of the peer; monitoring the transition is more valuable, which you can do by monitoring syslog
.
Monitoring the routing table provides trending on the size of the infrastructure. This is useful when you integrate with host-based solutions (such as Routing on the Host) when the routes track with the number of applications available.
OSPF Element | Monitoring Commands | Interval Poll |
---|---|---|
OSPF protocol peer failure | sudo vtysh -c "show ip ospf neighbor all json" cl-ospf summary show json |
60 seconds |
OSPF link state database | sudo vtysh - c "show ip ospf database" |
600 seconds |
Route and Host Entries
Route Element | Monitoring Commands | Interval Poll |
---|---|---|
Host Entries | cl-resource-query cl-resource-query -k |
600 seconds |
Route Entries | cl-resource-query cl-resource-query -k |
600 seconds |
Routing Logs
Layer 3 Logs | Log Location | Log Entries |
---|---|---|
Routing protocol process crash | /var/log/syslog |
frrouting[1824]: Starting FRRouting daemons (prio:10):. zebra. bgpd. |
Logging
The table below describes the various log files.
Logging Element | Monitoring Commands | Log Location |
---|---|---|
syslog | Catch all log file. Identifies memory leaks and CPU spikes. | /var/log/syslog |
switchd functionality | Hardware Abstraction Layer (HAL). | /var/log/switchd.log |
Routing daemons | FRR zebra daemon details. | /var/log/daemon.log |
Routing protocol | The log file is configurable in FRR. When FRR first boots, it uses the non-integrated configuration so each routing protocol has its own log file. After booting up, FRR switches over to using the integrated configuration, so that all logs go to a single place. To edit the location of the log files, use the log file rsyslog and into /var/log/syslog .Note: To write syslog debug messages to the log file, you must run the log syslog debug command to configure FRR with syslog severity 7 (debug); otherwise, when you issue a debug command such as debug bgp neighbor-events , no output logs to /var/log/frr/frr.log .However, when you manually define a log target with the log file /var/log/frr/debug.log command, FRR automatically defaults to severity 7 (debug) logging and the output logs to /var/log/frr/frr.log . |
/var/log/frr/zebra.log |
Device Management
Device Access Logs
Access Logs | Log Location | Log Entries |
---|---|---|
User Authentication and Remote Login | /var/log/syslog |
sshd[31830]: Accepted publickey for cumulus from 192.168.0.254 port 45582 ssh2: RSA 38:e6:3b:cc:04:ac:41:5e:c9:e3:93:9d:cc:9e:48:25 |
Device Super User Command Logs
Super User Command Logs | Log Location | Log Entries |
---|---|---|
Executing commands using sudo | /var/log/syslog |
sudo: cumulus: TTY=unknown ; PWD=/home/cumulus ; USER=root ; COMMAND=/tmp/script_9938.sh -v |