Skip to main content

Metrics — Prometheus + Grafana + Alertmanager

Web UIs:

  • Grafana: https://grafana.home.helix9.org (Authentik OIDC login)
  • Prometheus: https://prometheus.home.helix9.org (LAN + Authentik forward-auth)
  • Alertmanager: https://alerts.home.helix9.org (LAN + Authentik forward-auth)

Host: metrics / metrics.home.lab IP: 10.69.20.78 VLAN: 20 (SERVERS) VMID: 278


Overview

Single Rocky Linux 10 LXC running three Podman containers grouped in a single pod (metrics.pod). The pod gives them a shared network namespace, so containers reach each other on localhost and host-side port-forwards are bound at the pod level (not the container level — that detail matters, see Troubleshooting → "alerts dashboard suddenly down after restart").

clients ─► Traefik ─► Grafana ─► Prometheus ─► node_exporter (every LXC, :9100)

└─► Alertmanager ─► hookshot ─► Matrix room

Prometheus scrapes node_exporter on every LXC in mgmt, servers, and dmz groups (loop driven by inventory), plus Prometheus and Alertmanager themselves. Alert rules route through Alertmanager → hookshot generic webhook → Matrix.

Grafana authenticates via Authentik OIDC. Group grafana-admins → Admin role, grafana-editors → Editor, fallback Viewer.


Infrastructure

LXC Container

SettingValue
Nodepve02
VMID278
IP10.69.20.78/24
Gateway10.69.20.1
CPU2 cores
RAM4096 MB
Swap1024 MB
Disk16 GB
Templaterockylinux-10-default
Unprivilegedyes

Ansible

  • Playbook: playbooks/metrics.yml
  • Role: roles/metrics/
  • Host vars: inventory/host_vars/metrics/{vars,vault}.yml
  • Runtime: Podman Quadlet — 1 pod + 3 containers
  • Images: docker.io/prom/prometheus:latest, docker.io/grafana/grafana:latest, docker.io/prom/alertmanager:latest
  • Data dir: /srv/metrics/{prometheus,grafana,alertmanager}

Quadlets

Files generated under /etc/containers/systemd/:

  • metrics.pod → systemd unit metrics-pod.service
  • prometheus.containerprometheus.service
  • alertmanager.containeralertmanager.service
  • grafana.containergrafana.service
systemctl status metrics-pod prometheus grafana alertmanager
systemctl restart prometheus
podman pod ps # 1 pod, 4 containers (3 + infra)
podman ps
podman logs prometheus

Published Ports

PortServiceNotes
9090PrometheusAPI + UI; published by pod
3000GrafanaUI; published by pod
9093AlertmanagerAPI + UI; published by pod

Inside the pod everything is localhost — Grafana datasource is http://localhost:9090, Prometheus's alertmanager target is localhost:9093, etc.


Scrape Targets

Generated from inventory by prometheus.yml.j2. Initial jobs:

- job_name: node # node_exporter on every LXC
- job_name: prometheus # self-scrape
- job_name: alertmanager # self-scrape

To add another job, edit roles/metrics/templates/prometheus.yml.j2 and append, e.g.:

- job_name: traefik
static_configs:
- targets: ["{{ hostvars['traefik'].ansible_host }}:8082"]

- job_name: technitium
metrics_path: /api/metrics
static_configs:
- targets: ["{{ hostvars['technitium'].ansible_host }}:5380"]

Re-run ansible-playbook playbooks/metrics.yml — handler hits POST /-/reload, no restart needed.


Authentik OIDC Provider (manual one-time setup)

Authentik admin → Applications → Providers → Create → OAuth2/OpenID Provider:

  • Name: grafana
  • Client ID: grafana (must match metrics_grafana_oidc_client_id in role defaults — Authentik auto-generates a random one, edit it back to grafana)
  • Client Secret: copy → vault_grafana_oidc_client_secret
  • Redirect URI: https://grafana.home.helix9.org/login/generic_oauth
  • Signing key: default

Applications → Applications → Create:

  • Name: Grafana
  • Slug: grafana
  • Provider: grafana
  • Policy engine mode: any

Directory → Groups → Create grafana-admins, grafana-editors. Add users.

Save secret to vault:

ansible-vault edit inventory/host_vars/metrics/vault.yml

Redeploy: ansible-playbook playbooks/metrics.yml.

Group → Grafana role mapping is set in grafana.env.j2 via GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH:

  • member of grafana-admins → Admin
  • member of grafana-editors → Editor
  • otherwise → Viewer

Alert Routing → Matrix

Alertmanager has one receiver: matrix → hookshot generic webhook URL stored in vault_alertmanager_hookshot_webhook_url. Hookshot bridges to a Matrix room.

Create the webhook in hookshot (one-time):

  1. Invite hookshot bot to your alerts room.
  2. Make the bot a Moderator (Element → room settings → Roles & permissions → power level 50+).
  3. In room: !hookshot webhook alertmanager
  4. Bot replies with the URL — paste into vault.

Vault var:

vault_alertmanager_hookshot_webhook_url: "http://10.69.70.40:9000/webhook/<token>"

The hookshot bot must have a permissions: block in its config granting your MXID admin level — see roles/hookshot defaults hookshot_admin_mxids.

Cooldown (repeat_interval)

alertmanager.yml.j2:

route:
group_wait: 30s # bundle bursts
group_interval: 5m # how often to send updates while still firing
repeat_interval: 4h # how often to re-send unchanged firing alert

If you keep waiting for a re-send and don't see one, you're in repeat_interval. Reduce to 1h or 2h for a tighter feedback loop. To force-send while testing: systemctl restart alertmanager (resets in-memory notify state).

Initial alert rules

alert-rules.yml.j2:

  • NodeDown (5m no scrape) — critical
  • NodeHighMemoryUsage (>90% for 15m) — warning
  • NodeRootDiskFillingUp (<10% free for 30m) — warning
  • NodeHighLoad (load5 > 2x cores for 15m) — warning

Add per-service rules incrementally so first paging round isn't 50 noisy alerts.


Silencing Alerts

Three ways depending on intent:

1. One-off / temporary (Alertmanager UI)

Best for: planned maintenance, transient noise, "it'll be back tomorrow".

https://alerts.home.helix9.orgSilencesNew Silence:

  • Matcher: instance="10.69.70.50:9100" or alertname="NodeDown" (or both)
  • Duration: hours or days
  • Comment: explain why
  • Create

Auto-expires. Listed under Silences tab; can be deleted early.

API equivalent:

amtool silence add alertname=NodeDown instance=10.69.70.50:9100 \
--duration=24h --comment="rebuilding LXC" --author=marko \
--alertmanager.url=http://10.69.20.78:9093

2. Exclude permanently in the rule (code)

Best for: a host that is gone but inventory still lists it, or a known-flaky check you don't care about.

Edit roles/metrics/templates/alert-rules.yml.j2:

- alert: NodeDown
expr: up{job="node", instance!="10.69.70.50:9100"} == 0
...

Multiple:

expr: up{job="node", instance!~"10\\.69\\.70\\.50:9100|10\\.69\\.70\\.10:9100"} == 0

Redeploy. Cleaner than a permanent silence — surfaces in code review, not hidden in UI.

3. Stop scraping the host entirely

Best for: retired hosts, dev sandboxes you don't care about.

  • Remove from inventory/hosts.yml, or
  • Set enable_monitoring: false in its host_vars, or
  • Add an explicit drop in prometheus.yml.j2 scrape config.

After redeploy, the host disappears from /targets and no rules can fire on it.

4. Inhibit one alert via another (advanced)

alertmanager.yml.j2 already has:

inhibit_rules:
- source_matchers: [severity="critical"]
target_matchers: [severity="warning"]
equal: [alertname, instance]

Reads as: when a critical fires, suppress matching warnings on the same alertname+instance — avoids "host down" + "host high load" duplicates.


Firewall (VyOS)

The metrics LXC is in SERVERS. To talk to other zones it traverses VyOS.

Required rules in SERVERS-SCAN chain (used for SERVERS → DMZ/HOMELAB/IOT/GUEST/TRUSTED):

RulePurpose
200Prometheus scrape — 10.69.20.78 → */9100 tcp
210Alertmanager → hookshot — 10.69.20.78 → 10.69.70.40/9000 tcp

Required rule in SERVERS-MGMT:

RulePurpose
50Prometheus scrape MGMT — 10.69.20.78 → */9100 tcp

Cross-zone matrix from SERVERS:

To zoneChain
MGMTSERVERS-MGMT
DMZ / HOMELAB / IOT / GUEST / TRUSTEDSERVERS-SCAN

If a target shows context deadline exceeded and the host is up, suspect missing firewall rule.


Credentials

ItemWhere
Grafana local admin (admin)vault_grafana_admin_password — fallback if OIDC down
Grafana OIDC client_secretvault_grafana_oidc_client_secret
Alertmanager → hookshot URLvault_alertmanager_hookshot_webhook_url

All in inventory/host_vars/metrics/vault.yml (ansible-vault encrypted).


Operations

Reload config without restart

curl -X POST http://10.69.20.78:9090/-/reload
curl -X POST http://10.69.20.78:9093/-/reload

The Ansible role calls these via handlers when the rendered config changes — usually no manual step needed.

Restart the stack

systemctl restart prometheus etc. is safe — pod port-forwards stay intact (this is exactly why we run them in a pod).

Restart the pod itself only if you actually need to recreate the netns:

systemctl restart metrics-pod
# containers will follow because they have Pod=metrics.pod

Update images

Quadlets carry AutoUpdate=registry. The podman-auto-update.timer (enabled by roles/podman) pulls new :latest images daily and restarts containers. Manual:

podman auto-update
# or single container:
podman pull docker.io/prom/prometheus:latest
systemctl restart prometheus

Add a Grafana dashboard

UI: Dashboards → New → Import → paste an ID from grafana.com → select Prometheus datasource → Import. Recommended starters:

  • Node Exporter Full (1860)
  • Alertmanager (9578) — note: requires instance=alertmanager:9093 selected (not "All") for the single-stat panels to render
  • Prometheus 2.0 Stats (3662)

For permanent dashboards, drop JSON files into /srv/metrics/grafana/provisioning/dashboards/ (host-side) and add a provider yaml under /etc/grafana/provisioning/dashboards/.

Wipe TSDB

systemctl stop prometheus
rm -rf /srv/metrics/prometheus/data/*
systemctl start prometheus

Alertmanager / Grafana data dirs follow the same pattern.

Force-send queued alerts (testing)

systemctl restart alertmanager

Wipes in-memory notify-state, so still-firing alerts re-send immediately on next eval cycle. Only use during testing — in production it spams the alert channel.


Troubleshooting

"alerts dashboard suddenly down after restart"

After systemctl restart alertmanager, the UI / scrape returns refused or hangs.

Cause: netavark race when a container detaches+rejoins a shared network. Old DNAT rules persist, new rules conflict.

Fix (now that we use a pod): shouldn't happen — ports are bound at pod level, not per-container. If it does:

systemctl stop prometheus alertmanager grafana
nft delete table inet netavark
systemctl restart metrics-pod
sleep 2
systemctl start prometheus alertmanager grafana
nft list table inet netavark | grep -E 'dport (9090|9093|3000) dnat' # want 3 lines

The pod's [Service] ExecStartPre=-/usr/sbin/nft delete table inet netavark clears stale rules whenever the pod (re)starts.

Prometheus target shows connection refused

node_exporter not running on that host. Check:

ssh <host> systemctl status node_exporter

If absent → run monitoring role on it (enable_monitoring: true in its group_vars + ansible-playbook playbooks/site.yml --limit <host>).

Prometheus target shows context deadline exceeded

Network blocked. Check VyOS firewall rules for SERVERS → that zone on port 9100. See Firewall section.

Alertmanager logs connect: connection timed out on webhook

Same as above — VyOS blocking SERVERS → DMZ/wherever on the webhook port (typically 10.69.70.40:9000 for hookshot). Add a rule to SERVERS-SCAN.

"hookshot bot ignores !hookshot webhook ..."

  • Check permissions: block exists in hookshot config (grep permissions /srv/hookshot/config.yml).
  • Your MXID listed in hookshot_admin_mxids.
  • Bot has Moderator power level (PL ≥ 50) in the room.
  • Hookshot was restarted after config change.

Grafana login → "Client ID Error"

Authentik provider's auto-generated client_id ≠ grafana. Edit the provider in Authentik admin and set Client ID to literally grafana (or update metrics_grafana_oidc_client_id to match what Authentik generated).

Grafana login → "Failed to resolve application"

Authentik has the Provider but no Application bound. Create Application, slug = grafana, link to provider, set policy engine mode = any.

Alert fires in Prometheus but no Matrix message

  1. Check Alertmanager UI at https://alerts.home.helix9.org — is the alert there?
  2. If yes, is it under the matrix receiver? podman logs alertmanager | grep webhook
  3. Webhook delivery error? Likely firewall (see above).
  4. No error but no Matrix message: webhook URL in vault might be stale (room recreated, hookshot bot kicked, etc.). Re-create webhook with !hookshot webhook ... and update vault.

"I'm waiting for the alert to re-send and it doesn't"

repeat_interval: 4h cooldown. Either wait, lower the interval in alertmanager.yml.j2, or systemctl restart alertmanager to reset notify state.

Containers can't talk to each other

Inside the pod, everything is localhost. If you wrote http://prometheus:9090 in a config, change it to http://localhost:9090 — the pod doesn't have container DNS for sibling names.


Tips

  • Test one alert end-to-end before adding many. Add a vector(1) rule labelled severity=warning, watch it travel: Prometheus alerts page → Alertmanager → Matrix room. If any hop fails, fix that before scaling rules.

  • Use amtool for muting during deploys.

    podman exec alertmanager amtool silence add alertname=~".+" \
    --duration=2h --comment="deploy in progress"
  • Group thoughtfully. Default groups by [alertname, severity] — works for most. If you start getting 50-alert pages, group by instance instead.

  • Don't trust :latest forever. When podman-auto-update brings a breaking change (e.g., Grafana 12 → 13 OIDC behavior), pin a specific tag in metrics_grafana_version until you've validated.

  • Watch your own monitoring. Prometheus self-scrape + a up{job="prometheus"} == 0 rule is a chicken-and-egg classic — if Prometheus is down, no one alerts. Run a separate cheap check (uptime-kuma) against https://prometheus.home.helix9.org/-/ready.

  • For Grafana dashboards from the community, set the instance template variable explicitly rather than "All" — many community dashboards' single-stat panels break with multi-value variables.

  • Don't use HTTPS endpoints in prometheus.yml.j2 scrape targets unless you've configured tls_config. Stick to internal IPs over plain HTTP for scrape; reserve HTTPS for the human-facing Traefik routes.


Known Issues / Caveats

First-time chown

The role chowns data dirs to in-image UIDs (Prometheus/Alertmanager 65534, Grafana 472). If you bind-mount a directory created elsewhere by hand, re-run the role.

Bootstrap chicken-and-egg with Authentik

If Authentik is down, Grafana OIDC fails. The local admin password (vault_grafana_admin_password) still works at /login for emergency access.

:latest image lock

AutoUpdate=registry only updates if the registry digest changes. Grafana, Prometheus, Alertmanager all push :latest quickly — expect monthly rolls. Pin a specific tag for stability if needed.

Pod restart kills all containers

systemctl restart metrics-pod will stop and restart all three containers. Do this only when port-forward state is broken (very rare with the pod model). Normal config changes use single-container restart or /-/reload.