The Proxmox VE Exporter for Prometheus collects metrics from the cluster, its nodes and its guests. Specifically, it supports the following metrics for guests (LXCs and VMs):
- CPU usage:
pve_cpu_usage_ratioandpve_cpu_usage_limit. - RAM usage:
pve_memory_usage_bytesandpve_memory_size_bytes(free/used). - Disk space:
pve_disk_usage_bytesandpve_disk_size_bytes(used/total for root image). - Network I/O:
pve_network_transmit_bytesandpve_network_receive_bytes(cumulative since start). - Disk I/O:
pve_disk_write_bytesandpve_disk_read_bytes(cumulative since start). - Status:
pve_up(whether guests are running). - Uptime:
pve_uptime_seconds. - Metadata:
pve_guest_info(name, node, type, tags).
The PVE Exporter uses the Proxmox VE API to collect data, which is pulled by our Prometheus server, as usual. Therefore, this exporter can sit anywhere in the cluster:
- Alongside the existing Prometheus server, provided it has enough resources.
- In a dedicated, small LXC.
- In each node of the cluster.
For optimal performance, the exporter should be as close as possible (in terms of network hops) to the nodes. This is because it executes a lot of requests to obtain all the data it needs.
Installation #
We will be installing the exporter in each node. Therefore, the Prometheus server will contact the exporter in each node, which will gather the required metrics by sending a number of requests to the PVE API via the localhost. This will reduce the amount of traffic over the network and speed things up noticeably.
Incidentally, because our Prometheus server only had access to the private network of the guests and had to go through an HTTP proxy to access the Internet, we will be adding a public IP address to the LXC where it runs. This will reduce the number of network hops requests have to go through even further.
Wherever you decide to install the PVE Exporter, make sure you have a recent enough version of Python. At the moment of writing, the PVE Exporter requires Python 3.9+, so any Debian GNU/Linux from Bullseye onwards will do.
These are the steps we will follow:
- Create a
prometheususer. - Install a virtual environment for the application.
- Install
prometheus-pve-exporterin the virtual environment.
Let’s start by making sure we have installed the necessary packages:
apt-get update
apt-get install --yes ca-certificates curl python3 \
python3-pip python3-venv ssl-cert
Now let’s create the user and add it to the ssl-cert group, so it can read the TLS certificate used for encrypted communication between the exporter and Prometheus:
adduser --system --disabled-login --comment "Prometheus daemon" \
--home /var/lib/prometheus --groups ssl-cert prometheus
We are now ready to create the virtual environment:
python3 -m venv /opt/prometheus-pve-exporter
source /opt/prometheus-pve-exporter/bin/activate
And, finally, we can install the prometheus-pve-exporter into the virtual environment:
pip install prometheus-pve-exporter
deactivate
So that the prometheus user can operate without issues, the /opt/prometheus-pve-exporter folder and all files and subdirectories have to be readable by everyone. This should already be the case, as the default UMASK is 022, as defined in the /etc/login.defs file. You can check it out with the following command:
grep ^UMASK /etc/login.defs
Optionally, you can use the find command to make sure:
find /opt/prometheus-pve-exporter ! -perm -o=r
If the above command returns no results, you are ready to start the configuration steps.
API token #
We are going to need to generate a user and a token so that the Prometheus PVE Exporter can request the data it needs to generate the metrics. The easiest way to do this is via the terminal in any of the nodes:
pveum user add prometheus@pve --comment "Prometheus Monitoring"
pveum user token add prometheus@pve monitoring --privsep=0
This will output a token. Save it in some vault, such as Proton Pass or Vaultwarden, and use it later on, when configuring the /etc/prometheus/pve_exporter.yml file.
| Key | Value |
|---|---|
full-tokenid |
prometheus@pve!monitoring |
info |
{"privsep":"0"} |
value |
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx |
Finally, grant the PVEAuditor role (read-only) to the user:
pveum acl modify / --user prometheus@pve --role PVEAuditor
Configuration #
So that we have a consistent way to deploy Prometheus exporters them across our cluster, we will mimic the way they are packaged in Debian, that is, a file with the command-line arguments at /etc/default/ and a main configuration file inside /etc/prometheus/.
Command line #
Let’s start with the command-line arguments file at /etc/default/prometheus-pve-exporter:
# Environment variables for prometheus-pve-exporter
# See: https://github.com/prometheus-pve/prometheus-pve-exporter
#
# Bind to all interfaces.
# Disable the config collector for our large deployment (100+ guests),
# as it makes one API call per guest.
#
ARGS="--config.file='/etc/prometheus/pve_exporter.yml' \
--web.listen-address='[::]:9221' \
--collector.status \
--collector.version \
--collector.node \
--collector.cluster \
--collector.resources \
--collector.replication \
--no-collector.config \
--server.keyfile='/etc/ssl/private/localdomain.com.key' \
--server.certfile='/etc/ssl/certs/localdomain.com.crt'"
The certificate in the example is a wildcard certificate for the local domain of the cluster, managed internally via PowerDNS, and issued via Let’s Encrypt. Adapt it to your scenario.
Main options #
Given that this will probably be the first Prometheus-related package installed on the node, you will need to create the configuration directory first:
mkdir --parents --mode=0755 /etc/prometheus
You can now create the main configuration file at /etc/prometheus/pve_exporter.yml, with the following content:
# Prometheus PVE Exporter configuration
# See: https://github.com/prometheus-pve/prometheus-pve-exporter
default:
user: prometheus@pve
token_name: "monitoring"
token_value: "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
verify_ssl: false
Since the exporter will use the localhost to connect to the API, there is no point in verifying the TLS certificate of the connection.
Do not forget to restrict permissions on the /etc/prometheus/pve_exporter.yml file, given that it contains sensitive information:
chown prometheus:prometheus /etc/prometheus/pve_exporter.yml
chmod 600 /etc/prometheus/pve_exporter.yml
Systemd #
Let’s end with the systemd service file at /etc/systemd/system/prometheus-pve-exporter.service:
[Unit]
Description=Prometheus exporter for Proxmox VE
Documentation=https://github.com/prometheus-pve/prometheus-pve-exporter
After=network.target
[Service]
Type=simple
User=prometheus
Group=prometheus
EnvironmentFile=/etc/default/prometheus-pve-exporter
ExecStart=/opt/prometheus-pve-exporter/bin/pve_exporter $ARGS
# Security hardening
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
ReadWritePaths=/var/log
# Restart policy
Restart=on-failure
RestartSec=5s
# Resource limits
LimitNOFILE=8192
[Install]
WantedBy=multi-user.target
Starting up #
Our first moment of truth, let’s enable and start the service:
systemctl daemon-reload
systemctl enable --now prometheus-pve-exporter.service
You can check the status of the service with the following command:
systemctl status prometheus-pve-exporter.service
And you can monitor the logs produced by the service with this command:
journalctl --follow --unit=prometheus-pve-exporter.service
Do not forget to adjust the rules of your firewall so that the LXC with the Prometheus server can access the nodes with the PVE Exporter on port 9221. Remember that firewall rules at the datacentre level apply to all nodes.
You can now test the exporter from the terminal of the node:
curl "https://proxmox1.localdomain.com:9221/pve?module=default&cluster=1&node=1"
The list of available metrics will be useful when configuring Alert Rules in Grafana, so you may want to save it:
curl "https://proxmox1.localdomain.com:9221/pve?module=default&cluster=1&node=1" > ~/pve_exporter_metrics.txt
The value of the
moduleparametre in the query string corresponds to the configuration key we used in the main configuration file, i.e.,default.
Network traffic #
Because of how the Proxmox VE API is designed, the PVE Exporter needs to perform a fairly big amount of calls to it in order to get all the information it requires. Here you are some stats, for reference.
Let’s say we have 10 nodes and 200 guests, the scrape interval is set to 30 seconds and we are using the default 29 seconds cache duration of the exporter, without enabling the config collector.
| Job type | Endpoint GET /api2/json |
Calls | Metrics gathered |
|---|---|---|---|
| Cluster | /cluster/resources |
1 | Nodes, VMs, LXCs, storage, and metrics for guests |
| Cluster | /nodes |
1 | List of nodes and HA state information |
| Node | /nodes/{node}/status |
10 | Node status |
| Node | /nodes/{node}/storage |
10 | Storage information per node |
| Node | /nodes/{node}/disks/list |
10 | Physical disk information and SMART status |
| Node | /nodes/{node}/certificates/info |
10 | TLS certificate expiration information |
| Node | /nodes/{node}/version |
10 | PVE version information per node |
| Node | /nodes/{node}/replication |
10 | VM/LXC replication status |
This totals 62 calls.
If we were to enable the config collector, assuming we had 180 LXCs and 20 VMs, we would be adding:
| Job type | Endpoint GET /api2/json |
Calls | Observations |
|---|---|---|---|
| Guest | /nodes/{node}/qemu |
10 | Lists all VMs on each node |
| Guest | /nodes/{node}/lxc |
10 | Lists all LXCs on each node |
| Guest | /nodes/{node}/qemu/{vmid}/config |
20 | Individual VM configuration |
| Guest | /nodes/{node}/lxc/{ctid}/config |
180 | Individual LXC configuration |
This totals 220 calls.
The config collector provides an additional metric: pve_onboot_status. This shows whether each VM/LXC is configured to start automatically on node boot. You need to consider whether this information is relevant enough in your case to justify an additional 220 HTTP requests to the PVE API.
Prometheus #
We will add two scrape configurations to the /etc/prometheus/prometheus.yml file of our Prometheus server. The first one to gather cluster-level metrics:
scrape_configs:
- job_name: 'pve-cluster'
static_configs:
- targets:
- proxmox1.localdomain.com
metrics_path: /pve
params:
module: [default]
cluster: ['1']
node: ['0']
And the second one to gather node-level metrics:
scrape_configs:
- job_name: 'pve-nodes'
file_sd_configs:
- files:
- file_sd_configs/pve_exporter.yml
metrics_path: /pve
params:
module: [default]
cluster: ['0']
node: ['1']
And our /etc/prometheus/file_sd_configs/pve_exporter.yml would have the following content:
- targets:
- 'proxmox1.localdomain.com:9221'
- 'proxmox2.localdomain.com:9221'
- 'proxmox3.localdomain.com:9221'
- 'proxmox4.localdomain.com:9221'
- 'proxmox5.localdomain.com:9221'
- 'proxmox6.localdomain.com:9221'
- 'proxmox7.localdomain.com:9221'
- 'proxmox8.localdomain.com:9221'
- 'proxmox9.localdomain.com:9221'
- 'proxmox10.localdomain.com:9221'
labels:
group: 'pve'
You can skip the node-level metrics if you do not use replication.
We need to ask Prometheus to reload its configuration file for the changes to take effect:
systemctl reload prometheus
This configuration efficiently scrapes cluster and guest metrics once and node-specific metrics once per node, which is the recommended approach for large clusters. All guest (VM/LXC) metrics are included automatically in the cluster-wide scrape, so you will get metrics for all your containers without needing to scrape them individually.
Grafana dashboard #
The Prometheus PVE Exporter repository provides a link to a Proxmox via Prometheus Grafana dashboard, which we can use as a starting point. Follow these simple steps:
- Visit the Grafana Dashboard webiste.
- Download the dashboard in JSON format using the
Download JSONbutton. - In your Grafana, navigate to the
Dashboardsmenu option. - Use the
New > Importbutton and choose the downloaded JSON file. - Change the default name if you wish, assign a folder and select the Prometheus data source. Click
Import. - Once loaded, optionally, click on
Settingsand add some tags to it.
You are set.