Monitoring Best Practices
The following monitoring processes are considered best practices for reviewing and troubleshooting potential issues with Cumulus Linux environments. In addition, several of the more common issues have been listed, with potential solutions included.
This document describes:
- Metrics that you can poll from Cumulus Linux and use in trend analysis
- Critical log messages that you can monitor for triggered alerts
Trend Analysis Using Metrics
A metric is a quantifiable measure that is used to track and assess the status of a specific infrastructure component. It is a check collectedover time. Examples of metrics include bytes on an interface, CPU utilization, and total number of routes.
Metrics are more valuable when used for trend analysis.
Generate Alerts with Triggered Logging
Triggered issues are normally sent to syslog
, but can go to another log file depending on the feature. In Cumulus Linux, rsyslog
handles all logging, including local and remote logging. Logs are the best method to use for generating alerts when the system transitions from a stable steady state.
Sending logs to a centralized collector, then creating alerts based on critical logs is an optimal solution for alerting.
Log Formatting
Most log files in Cumulus Linux use a standard presentation format. For example, consider this syslog
entry:
2017-03-08T06:26:43.569681+00:00 leaf01 sysmonitor: Critically high CPU use: 99%
- 2017-03-08T06:26:43.569681+00:00 is the timestamp.
- leaf01 is the hostname.
- sysmonitor is the process that is the source of the message.
- Critically high CPU use: 99% is the message.
For brevity and legibility, the timestamp and hostname have been omitted from the examples in this chapter.
Hardware
The smond
process provides monitoring functionality for various switch hardware elements. Minimum or maximum values are output depending on the flags applied to the basic command. The hardware elements and applicable commands and flags are listed in the table below.
Hardware Element | Monitoring Commands | Interval Poll |
---|---|---|
Temperature | cumulus@switch:~$ smonctl -j |
10 seconds |
Fan | cumulus@switch:~$ smonctl -j |
10 seconds |
PSU | cumulus@switch:~$ smonctl -j |
10 seconds |
PSU Fan | cumulus@switch:~$ smonctl -j |
10 seconds |
PSU Temperature | cumulus@switch:~$ smonctl -j |
10 seconds |
Voltage | cumulus@switch:~$ smonctl -j |
10 seconds |
Front Panel LED | cumulus@switch:~$ ledmgrd -d You can also run |
5 seconds |
Not all switch models include a sensor for monitoring power consumption and voltage. See this note for details.
Hardware Logs | Log Location | Log Entries |
---|---|---|
High temperature | /var/log/syslog |
/usr/sbin/smond : : Temp1(Board Sensor near CPU): state changed from UNKNOWN to OK |
Fan speed issues | /var/log/syslog |
/usr/sbin/smond : : Fan1(Fan Tray 1, Fan 1): state changed from UNKNOWN to OK |
PSU failure | /var/log/syslog |
/usr/sbin/smond : : PSU1Fan1(PSU1 Fan): state changed from UNKNOWN to OK |
System Data
Cumulus Linux includes a number of ways to monitor various aspects of system data. In addition, alerts are issued in high risk situations.
CPU Idle Time
When a CPU reports five high CPU alerts within a span of five minutes, an alert is logged.
Short bursts of high CPU can occur during switchd
churn or routing protocol startup. Do not set alerts for these short bursts.
System Element | Monitoring Commands | Interval Poll |
---|---|---|
CPU utilization | cumulus@switch:~$ cat /proc/stat |
30 seconds |
CPU Logs | Log Location | Log Entries |
---|---|---|
High CPU | /var/log/syslog |
sysmonitor: Critically high CPU use: 99% |
Cumulus Linux 3.0 and later monitors CPU, memory, and disk space via sysmonitor
. The configurations for the thresholds are stored in /etc/cumulus/sysmonitor.conf
. More information is available with man sysmonitor
.
CPU measure | Thresholds |
---|---|
Use | Alert: 90% Crit: 95% |
Process Load | Alarm: 95% Crit: 125% |
Disk Usage
When monitoring disk utilization, you can exclude tmpfs
from monitoring.
System Element | Monitoring Commands | Interval Poll |
---|---|---|
Disk utilization | cumulus@switch:~$ /bin/df -x tmpfs |
300 seconds |
Process Restart
In Cumulus Linux, systemd
is responsible for monitoring and restarting processes.
Process Element | Monitoring Commands |
---|---|
View processes monitored by systemd | cumulus@switch:~$ systemctl status |
Layer 1 Protocols and Interfaces
Link and port state interface transitions are logged to /var/log/syslog
and /var/log/switchd.log
.
Interface Element | Monitoring Commands |
---|---|
Link state | cumulus@switch:~$ cat /sys/class/net/[iface]/operstate |
Link speed | cumulus@switch:~$ cat /sys/class/net/[iface]/speed |
Port state | cumulus@switch:~$ ip link show |
Bond state | cumulus@switch:~$ cat /proc/net/bonding/[bond] |
Interface counters are obtained from either querying the hardware or the Linux kernel. The two outputs should align, but the Linux kernel aggregates the output from the hardware.
Interface Counter Element | Monitoring Commands | Interval Poll |
---|---|---|
Interface counters | cumulus@switch:~$ cat /sys/class/net/[iface]/statistics/[stat_name] |
10 seconds |
Layer 1 Logs | L og Location | Log Entries |
---|---|---|
Link failure/Link flap | /var/log/switchd.log |
switchd[5692]: nic.c:213 nic_set_carrier: swp17: setting kernel carrier: down |
Unidirectional link | /var/log/switchd.log |
ptmd[7146]: ptm_bfd.c:2471 Created new session 0x1 with peer 10.255.255.11 port swp1 |
Bond Negotiation Working | /var/log/syslog |
kernel: [85412.763193] bonding: bond0 is being created… |
Bond Negotiation Failing | /var/log/syslog |
kernel: [85412.763193] bonding: bond0 is being created… |
MLAG peerlink negotiation Working | /var/log/syslog |
lldpd[998]: error while receiving frame on swp50: Network is down |
/var/log/clagd.log |
clagd[14003]: Cleanup is executing. |
|
MLAG peerlink negotiation Failing | /var/log/syslog |
lldpd[998]: error while receiving frame on swp50: Network is down |
/var/log/clagd.log |
clagd[26916]: Cleanup is executing. |
|
MLAG port negotiation Working | /var/log/syslog |
kernel: [77419.112195] bonding: server01 is being created… |
/var/log/clagd.log |
clagd[14003]: server01 is now dual connected. |
|
MLAG port negotiation Failing | /var/log/syslog |
kernel: [79290.290999] bonding: server01 is being created… |
/var/log/clagd.log |
clagd[14291]: Conflict (server01): matching clag-id (1) not configured on peer… |
|
MLAG port negotiation Flapping | /var/log/syslog |
mstpd: one_clag_cmd: setting (0) mac 00:00:00:00:00:00 <server01, None> |
/var/log/clagd.log |
clagd[14291]: server01 is no longer dual connected |
Prescriptive Topology Manager (PTM) uses LLDP information to compare against a topology.dot
file that describes the network. It has built in alerting capabilities, so it is preferable to use PTM on box rather than polling LLDP information regularly. The PTM code is available in the Cumulus Networks GitHub repository. Additional PTM, BFD, and associated logs are documented in the code.
Tracking peering information through PTM is highly recommended. For more information, refer to the Prescriptive Topology Manager documentation.
Neighbor Element | Monitoring Commands | Interval Poll |
---|---|---|
LLDP Neighbor | cumulus@switch:~$ lldpctl -f json |
300 seconds |
Prescriptive Topology Manager | cumulus@switch:~$ ptmctl -j [-d] |
Triggered |
Layer 2 Protocols
Spanning tree is a protocol that prevents loops in a layer 2 infrastructure. In a stable state, the spanning tree protocol should stably converge. Monitoring the Topology Change Notifications (TCN) in STP helps identify when new BPDUs are received.
Interface Counter Element | Monitoring Commands | Interval Poll |
---|---|---|
STP TCN Transitions | cumulus@switch:~$ mstpctl showbridge json |
60 seconds |
MLAG peer state | cumulus@switch:~$ clagctl status |
60 seconds |
MLAG peer MACs | cumulus@switch:~$ clagctl dumppeermacs |
300 seconds |
Layer 2 Logs | Log Location | Log Entries |
---|---|---|
Spanning Tree Working | /var/log/syslog |
kernel: [1653877.190724] device swp1 entered promiscuous mode |
Spanning Tree Blocking | /var/log/syslog |
mstpd: MSTP_OUT_set_state: bridge:swp2:0 entering blocking state(Designated) |
Layer 3 Protocols
When FRRouting boots up for the first time, there is a different log file for each daemon that is activated. If the log file is ever edited (for example, through vtysh
or frr.conf
), the integrated configuration sends all logs to the same file.
To send FRRouting logs to syslog, apply the configuration log syslog
in vtysh
.
BGP
When monitoring BGP, check if BGP peers are operational. There is not much value in alerting on the current operational state of the peer; monitoring the transition is more valuable, which you can do by monitoring syslog
.
Monitoring the routing table provides trending on the size of the infrastructure. This is especially useful when integrated with host-based solutions (such as Routing on the Host) when the routes track with the number of applications available.
BGP Element | Monitoring Commands | Interval Poll |
---|---|---|
BGP peer failure | cumulus@switch:~$ sudo vtysh -c “show ip bgp summary json” |
60 seconds |
BGP route table | cumulus@switch:~$ sudo vtysh -c “show ip bgp json” |
600 seconds |
BGP Logs | Log Location | Log Entries |
---|---|---|
BGP peer down | /var/log/syslog |
bgpd[3000]: %NOTIFICATION: sent to neighbor swp1 4/0 (Hold Timer Expired) 0 bytes |
OSPF
When monitoring OSPF, check if OSPF peers are operational. There is not much value in alerting on the current operational state of the peer; monitoring the transition is more valuable, which you can do by monitoring syslog
.
Monitoring the routing table provides trending on the size of the infrastructure. This is especially useful when integrated with host-based solutions (such as Routing on the Host) when the routes track with the number of applications available.
OSPF Element | Monitoring Commands | Interval Poll |
---|---|---|
OSPF protocol peer failure | cumulus@switch:~$ sudo vtysh -c “show ip ospf neighbor all json” |
60 seconds |
OSPF link state database | cumulus@switch:~$ sudo vtysh - c “show ip ospf database” |
600 seconds |
Route and Host Entries
Route Element | Monitoring Commands | Interval Poll |
---|---|---|
Host Entries | cumulus@switch:~$ cl-resource-query |
600 seconds |
Route Entries | cumulus@switch:~$ cl-resource-query |
600 seconds |
You can also run the net show system asic
command, which is the NCLU command equivalent of cl-resource-query
.
Routing Logs
Layer 3 Logs | Log Location | Log Entries |
---|---|---|
Routing protocol process crash | /var/log/syslog |
frrouting[1824]: Starting FRRouting daemons (prio:10):. zebra. bgpd. |
Logging
The table below describes the various log files.
Logging Element | Monitoring Commands | Log Location |
---|---|---|
syslog | Catch all log file. Identifies memory leaks and CPU spikes. | /var/log/syslog |
switchd functionality | Hardware Abstraction Layer (HAL). | /var/log/switchd.log |
Routing daemons | FRRouting zebra daemon details. | /var/log/daemon.log |
Routing protocol | The log file is configurable in FRRouting. When FRRouting first boots, it uses the non-integrated configuration so each routing protocol has its own log file. After booting up, FRRouting switches over to using the integrated configuration, so that all logs go to a single place. To edit the location of the log files, use the log file Note: To write syslog debug messages to the log file, you must run the log syslog debug command to configure FRR with syslog severity 7 (debug); otherwise, when you issue a debug command such as, debug bgp neighbor-events, no output is sent to /var/log/frr/frr.log. However, when you manually define a log target with the log file /var/log/frr/debug.log command, FRR automatically defaults to severity 7 (debug) logging and the output is logged to /var/log/frr/frr.log. |
/var/log/frr/zebra.log |
Protocols and Services
Run the following command to confirm that the NTP process is working correctly and that the switch clock is in sync with NTP:
cumulus@switch:~$ /usr/bin/ntpq -p
Device Management
Device Access Logs
Access Logs | Log Location | Log Entries |
---|---|---|
User Authentication and Remote Login | /var/log/syslog |
sshd[31830]: Accepted publickey for cumulus from 192.168.0.254 port 45582 ssh2: RSA 38:e6:3b:cc:04:ac:41:5e:c9:e3:93:9d:cc:9e:48:25 |
Device Super User Command Logs
Super User Command Logs | Log Location | Log Entries |
---|---|---|
Executing commands using sudo | /var/log/syslog |
sudo: cumulus: TTY=unknown ; PWD=/home/cumulus ; USER=root ; COMMAND=/tmp/script_9938.sh -v |