Skip to main content
  1. Posts/
  2. Grafana/

Gathering system and hardware metrics with Node Exporter

··1384 words·7 mins·
Jaume Sabater
Author
Jaume Sabater
CTO and systems engineer
Grafana - This article is part of a series.
Part 11: This Article

The Prometheus Node Exporter is a fundamental component in any Prometheus monitoring stack, designed specifically to collect and expose a wide array of machine-level and operating system metrics from Linux, forming the backbone for comprehensive infrastructure monitoring and alerting.

It acts as an agent running on our Proxmox VMs and nodes (not containers), providing metrics such as CPU usage, memory consumption, disk I/O, network traffic, and filesystem capacity.

Installation
#

We will install it from the Debian repository, then manually update the binary from the Prometheus Node Exporter releases page at Github.

apt-get update
apt-get install prometheus-node-exporter ssl-cert
adduser prometheus ssl-cert

Optionally, update the binary file with the more modern version from Github:

export NODE_EXPORTER_VERSION="1.10.2"
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz \
     --output-document=/tmp/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
tar --directory=/tmp --extract --gzip --file=/tmp/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
cp /tmp/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/node_exporter /usr/bin/prometheus-node-exporter

You will have to repeat this installation in every VM and node, so you may want to automate it using Ansible.

Configuration
#

The configuration of the Node Exporter requires just two files. Begin by editing the configuration file /etc/prometheus/node.yml, which contains the web configuration for the exporter.

# Prometheus Node Exporter configuration
# See: https://prometheus.io/docs/prometheus/latest/configuration/https/

# Allow TLS connections
tls_server_config:
  cert_file: /etc/ssl/certs/localdomain.com.crt
  key_file: /etc/ssl/private/localdomain.com.key
  min_version: TLS12

# Enable HTTP/2 support, which is only supported with TLS
http_server_config:
  http2: true

The localdomain.com certificate is a wildcard certificate for the local domain of the cluster, managed internally via PowerDNS, and issued via Let’s Encrypt. Adapt it to your scenario.

And let’s finish with the /etc/default/prometheus-node-exporter file, that contains the the command-line arguments given to the binary:

# Set the command-line arguments to pass to the server.
ARGS="--web.config.file=/etc/prometheus/node.yml \
      --collector.systemd \
      --collector.processes"

At this point, the service file should be enabled and the daemon started, so all you need is to restart the server, then check its status.

systemctl restart prometheus-node-exporter
systemctl status prometheus-node-exporter

Prometheus
#

Let’s now switch to the LXC holding the Prometheus server. We need to configure it with all the targets where the Node Exporter is being run.

As a first step, if you have not already, create the directories for the file-based discovery configuration files:

mkdir --parents --mode=0755 /etc/prometheus/file_sd_configs

The file-based service discovery is a mechanism that allows Prometheus to automatically discover and manage scrape targets by reading the necessary information from files on disk. Prometheus periodically checks these files for changes, and upon detecting modifications, it updates its list of targets without requiring a restart or reload.

This approach is particularly useful for integrating custom or third-party service discovery systems, as it enables external processes, such as configuration management tools, cron jobs, or dedicated sidecar programs, to generate the target files.

The default refresh interval for file-based discovery is 5 minutes, but this can be configured using the refresh_interval parameter in the Prometheus configuration.

Example /etc/prometheus/file_sd_configs/node_exporter.yml file:

- targets:
  - 'mongodb1.localdomain.com:9100'
  - 'mongodb2.localdomain.com:9100'
  - 'nfs1.localdomain.com:9100'
  - 'nfs2.localdomain.com:9100'
  - 'postgresql1.localdomain.com:9100'
  - 'postgresql2.localdomain.com:9100'
  labels:
    group: 'qemu'

We are now ready to edit the /etc/prometheus/prometheus.yml configuration file of our Prometheus server to configure the new job under the scrape_configs key:

scrape_configs:

  - job_name: 'node_exporter'
    scrape_interval: 15s
    scheme: https
    tls_config:
      ca_file: /etc/ssl/certs/ISRG_Root_X1.pem
      insecure_skip_verify: false
    file_sd_configs:
      - files:
        - file_sd_configs/node_exporter.yml
    relabel_configs:
      - source_labels: [__address__]
        regex: '(\w)\.localdomain\.com:.*'
        target_label: host
        replacement: '$1'

Note the following aspects of this configuration file:

  • scrape_interval: 15s, to increase how often the target is scraped (its default value is 1m).
  • scheme: https, to encrypt communications between the Prometheus server and the Node Exporter daemons.
  • insecure_skip_verify: false, the default value, to force Prometheus to perform a full TLS certificate validation.
  • relabel_configs, to add the host label from the instance. This is done in all the jobs so that we can use it when filtering by all metrics belonging to a host, no matter the exporter that brought them in.

Available metrics
#

This configuration would provide the following metrics in Prometheus, among many others:

  • Total CPU time spent waiting for I/O operations node_cpu_seconds_total{mode="iowait"}.
  • Total physical memory: node_memory_MemTotal_bytes.
  • Estimated memory available for new workloads: node_memory_MemAvailable_bytes.
  • Memory used for I/O buffering and caching: node_memory_Buffers_bytes, node_memory_Cached_bytes.
  • Bytes read/written to disk: node_disk_read_bytes_total, node_disk_written_bytes_total.
  • Number of completed read/write operations: node_disk_reads_completed_total, node_disk_writes_completed_total.
  • Number of available bytes in disk: node_filesystem_avail_bytes{fstype!~"tmpfs|overlay|squashfs"}.
  • Bytes received/transmitted per interface: node_network_receive_bytes_total, node_network_transmit_bytes_total.
  • System load averages: node_load1, node_load5, node_load15.
  • Timestamp of last boot: node_boot_time_seconds

When configuring alerts, the following expression would let us know of a reboot of a VM during the last 10 minutes:

(
  (pve_uptime_seconds{job="pve-cluster", id=~"qemu/.+"} < 300)
  and
  (pve_uptime_seconds{job="pve-cluster", id=~"qemu/.+"} offset 10m > 600)
  and on(id)
  pve_guest_info{job="pve-cluster", template="0"}
) > 0
or
(
  pve_uptime_seconds{job="pve-cluster", id=~"qemu/.+"} * 0
  and on(id)
  pve_guest_info{job="pve-cluster", template="0"}
)

Do not forget to allow traffic to port 9100 of your guests in your firewall.

You can use curl to test connectivity and obtain a list of the available metrics:

curl -k https://webapp1.localdomain.com:9100/metrics

Collectors
#

As per the README file in its Github repository, Node Exporter provides a huge amount of collectors. This is the list of collectors that are enabled by default and are available in the Linux platform:

Collector Description
arp Exposes ARP statistics from /proc/net/arp
bcache Exposes bcache statistics from /sys/fs/bcache/
bonding Exposes network bonding statistics from /sys/class/net/*/bonding/
boottime Exposes system boot time derived from /proc/stat
conntrack Exposes conntrack statistics (connections, memory usage)
cpu Exposes CPU usage metrics from /proc/stat
cpufreq Exposes CPU frequency information
diskstats Exposes disk I/O statistics from /proc/diskstats
edac Exposes error detection and correction (EDAC) statistics
entropy Exposes available entropy from /proc/sys/kernel/random/entropy_avail
exec Exposes process execution statistics
fibrechannel Exposes Fibre Channel device statistics
filesystem Exposes filesystem usage (space, inodes) from /proc/mounts and /proc/self/mountstats
hwmon Exposes hardware monitoring metrics from /sys/class/hwmon/
infiniband Exposes InfiniBand statistics from /sys/class/infiniband/
ipvs Exposes IPVS connection statistics
loadavg Exposes load average from /proc/loadavg
logind Exposes user session counts from logind
mdadm Exposes RAID status from /proc/mdstat
meminfo Exposes memory usage from /proc/meminfo
netclass Exposes network interface info from /sys/class/net/
netdev Exposes network device statistics from /proc/net/dev
netstat Exposes network statistics from /proc/net/netstat, /proc/net/snmp, etc.
nfs Exposes NFS client statistics from /proc/net/rpc/nfs
nfsd Exposes NFS server statistics from /proc/net/rpc/nfsd
os Exposes OS-level information
powersupplyclass Exposes power supply state from /sys/class/power_supply/
schedstat Exposes scheduler statistics from /proc/schedstat
sockstat Exposes socket statistics from /proc/net/sockstat
stat Exposes various system statistics from /proc/stat
textfile Exposes metrics read from a text file (directory configured via flag)
thermal_zone Exposes thermal zone info from /sys/class/thermal/
time Exposes the current system time
timex Exposes time synchronization status from adjtimex()
uname Exposes system information from uname()
vmstat Exposes virtual memory statistics from /proc/vmstat
xfs Exposes XFS filesystem statistics from /proc/fs/xfs/stat
zfs Exposes ZFS performance and status metrics
zswap Exposes zswap statistics from /sys/module/zswap/parameters/

In the example above, we enabled two additional collectors, systemd and processes, but the list of collectors disabled by default is much more extensive. This is the list of available collectors for the Linux platform:

Collector Description
buddyinfo Exposes memory fragmentation stats from /proc/buddyinfo
cgroups Exposes cgroup summary (active/enabled)
cpu_vulnerabilities Exposes CPU vulnerability info (Spectre, Meltdown) from /sys/devices/system/cpu/vulnerabilities/
drm Exposes GPU metrics via DRM (e.g., amdgpu)
drbd Exposes DRBD (Distributed Replicated Block Device) stats
ethtool Exposes network driver stats via ethtool -S
interrupts Exposes detailed interrupt statistics
kernel_hung Exposes hung task count from /proc/sys/kernel/hung_task_detect_count
ksmd Exposes Kernel Same-page Merging stats
lnstat Exposes netlink statistics from /proc/net/stat/
meminfo_numa Exposes NUMA memory stats from /sys/devices/system/node/
mountstats Exposes detailed NFS client stats from /proc/self/mountstats
network_route Exposes routing table as metrics
pcidevice Exposes PCI device info and link status
perf Exposes performance counter metrics (kernel-dependent)
processes Exposes aggregate process stats from /proc
qdisc Exposes queuing discipline statistics
slabinfo Exposes kernel slab allocator stats from /proc/slabinfo
softirqs Exposes softirq statistics from /proc/softirqs
sysctl Exposes arbitrary sysctl values
swap Exposes swap usage from /proc/swaps
systemd Exposes systemd service and unit status
tcpstat Exposes TCP connection state from /proc/net/tcp
wifi Exposes WiFi device and station stats
xfrm Exposes IPsec statistics from /proc/net/xfrm_stat
zoneinfo Exposes NUMA memory zone metrics

All these collectors can be controlled via command-line flags in our /etc/default/prometheus-node-exporter file:

  • To enable a collector, use --collector.<name>.
  • To disable a collector, use --no-collector.<name>.
  • To disable all default collectors, use --collector.disable-defaults, then enable the ones you want one by one.

For example, to enable only CPU and memory collectors, you would configure your /etc/default/prometheus-node-exporter file as follows:

# Set the command-line arguments to pass to the server.
ARGS="--web.config.file=/etc/prometheus/node.yml \
      --collector.disable-defaults \
      ---collector.cpu \
      --collector.meminfo"

You need to restart the exporter for these changes to take effect:

systemctl restart prometheus-node-exporter.service
Grafana - This article is part of a series.
Part 11: This Article

Related