Metrics — Prometheus + Grafana + Alertmanager
Web UIs:
- Grafana:
https://grafana.home.helix9.org(Authentik OIDC login) - Prometheus:
https://prometheus.home.helix9.org(LAN + Authentik forward-auth) - Alertmanager:
https://alerts.home.helix9.org(LAN + Authentik forward-auth)
Host: metrics / metrics.home.lab
IP: 10.69.20.78
VLAN: 20 (SERVERS)
VMID: 278
Overview
Single Rocky Linux 10 LXC running three Podman containers grouped in a single pod (metrics.pod). The pod gives them a shared network namespace, so containers reach each other on localhost and host-side port-forwards are bound at the pod level (not the container level — that detail matters, see Troubleshooting → "alerts dashboard suddenly down after restart").
clients ─► Traefik ─► Grafana ─► Prometheus ─► node_exporter (every LXC, :9100)
│
└─► Alertmanager ─► hookshot ─► Matrix room
Prometheus scrapes node_exporter on every LXC in mgmt, servers, and dmz groups (loop driven by inventory), plus Prometheus and Alertmanager themselves. Alert rules route through Alertmanager → hookshot generic webhook → Matrix.
Grafana authenticates via Authentik OIDC. Group grafana-admins → Admin role, grafana-editors → Editor, fallback Viewer.
Infrastructure
LXC Container
| Setting | Value |
|---|---|
| Node | pve02 |
| VMID | 278 |
| IP | 10.69.20.78/24 |
| Gateway | 10.69.20.1 |
| CPU | 2 cores |
| RAM | 4096 MB |
| Swap | 1024 MB |
| Disk | 16 GB |
| Template | rockylinux-10-default |
| Unprivileged | yes |
Ansible
- Playbook:
playbooks/metrics.yml - Role:
roles/metrics/ - Host vars:
inventory/host_vars/metrics/{vars,vault}.yml - Runtime: Podman Quadlet — 1 pod + 3 containers
- Images:
docker.io/prom/prometheus:latest,docker.io/grafana/grafana:latest,docker.io/prom/alertmanager:latest - Data dir:
/srv/metrics/{prometheus,grafana,alertmanager}
Quadlets
Files generated under /etc/containers/systemd/:
metrics.pod→ systemd unitmetrics-pod.serviceprometheus.container→prometheus.servicealertmanager.container→alertmanager.servicegrafana.container→grafana.service
systemctl status metrics-pod prometheus grafana alertmanager
systemctl restart prometheus
podman pod ps # 1 pod, 4 containers (3 + infra)
podman ps
podman logs prometheus
Published Ports
| Port | Service | Notes |
|---|---|---|
| 9090 | Prometheus | API + UI; published by pod |
| 3000 | Grafana | UI; published by pod |
| 9093 | Alertmanager | API + UI; published by pod |
Inside the pod everything is localhost — Grafana datasource is http://localhost:9090, Prometheus's alertmanager target is localhost:9093, etc.
Scrape Targets
Generated from inventory by prometheus.yml.j2. Initial jobs:
- job_name: node # node_exporter on every LXC
- job_name: prometheus # self-scrape
- job_name: alertmanager # self-scrape
To add another job, edit roles/metrics/templates/prometheus.yml.j2 and append, e.g.:
- job_name: traefik
static_configs:
- targets: ["{{ hostvars['traefik'].ansible_host }}:8082"]
- job_name: technitium
metrics_path: /api/metrics
static_configs:
- targets: ["{{ hostvars['technitium'].ansible_host }}:5380"]
Re-run ansible-playbook playbooks/metrics.yml — handler hits POST /-/reload, no restart needed.
Authentik OIDC Provider (manual one-time setup)
Authentik admin → Applications → Providers → Create → OAuth2/OpenID Provider:
- Name:
grafana - Client ID:
grafana(must matchmetrics_grafana_oidc_client_idin role defaults — Authentik auto-generates a random one, edit it back tografana) - Client Secret: copy →
vault_grafana_oidc_client_secret - Redirect URI:
https://grafana.home.helix9.org/login/generic_oauth - Signing key: default
Applications → Applications → Create:
- Name:
Grafana - Slug:
grafana - Provider:
grafana - Policy engine mode:
any
Directory → Groups → Create grafana-admins, grafana-editors. Add users.
Save secret to vault:
ansible-vault edit inventory/host_vars/metrics/vault.yml
Redeploy: ansible-playbook playbooks/metrics.yml.
Group → Grafana role mapping is set in grafana.env.j2 via GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH:
- member of
grafana-admins→ Admin - member of
grafana-editors→ Editor - otherwise → Viewer
Alert Routing → Matrix
Alertmanager has one receiver: matrix → hookshot generic webhook URL stored in vault_alertmanager_hookshot_webhook_url. Hookshot bridges to a Matrix room.
Create the webhook in hookshot (one-time):
- Invite hookshot bot to your alerts room.
- Make the bot a Moderator (Element → room settings → Roles & permissions → power level 50+).
- In room:
!hookshot webhook alertmanager - Bot replies with the URL — paste into vault.
Vault var:
vault_alertmanager_hookshot_webhook_url: "http://10.69.70.40:9000/webhook/<token>"
The hookshot bot must have a permissions: block in its config granting your MXID admin level — see roles/hookshot defaults hookshot_admin_mxids.
Cooldown (repeat_interval)
alertmanager.yml.j2:
route:
group_wait: 30s # bundle bursts
group_interval: 5m # how often to send updates while still firing
repeat_interval: 4h # how often to re-send unchanged firing alert
If you keep waiting for a re-send and don't see one, you're in repeat_interval. Reduce to 1h or 2h for a tighter feedback loop. To force-send while testing: systemctl restart alertmanager (resets in-memory notify state).
Initial alert rules
alert-rules.yml.j2:
NodeDown(5m no scrape) — criticalNodeHighMemoryUsage(>90%for 15m) — warningNodeRootDiskFillingUp(<10%free for 30m) — warningNodeHighLoad(load5 > 2x cores for 15m) — warning
Add per-service rules incrementally so first paging round isn't 50 noisy alerts.
Silencing Alerts
Three ways depending on intent:
1. One-off / temporary (Alertmanager UI)
Best for: planned maintenance, transient noise, "it'll be back tomorrow".
https://alerts.home.helix9.org → Silences → New Silence:
- Matcher:
instance="10.69.70.50:9100"oralertname="NodeDown"(or both) - Duration: hours or days
- Comment: explain why
- Create
Auto-expires. Listed under Silences tab; can be deleted early.
API equivalent:
amtool silence add alertname=NodeDown instance=10.69.70.50:9100 \
--duration=24h --comment="rebuilding LXC" --author=marko \
--alertmanager.url=http://10.69.20.78:9093
2. Exclude permanently in the rule (code)
Best for: a host that is gone but inventory still lists it, or a known-flaky check you don't care about.
Edit roles/metrics/templates/alert-rules.yml.j2:
- alert: NodeDown
expr: up{job="node", instance!="10.69.70.50:9100"} == 0
...
Multiple:
expr: up{job="node", instance!~"10\\.69\\.70\\.50:9100|10\\.69\\.70\\.10:9100"} == 0
Redeploy. Cleaner than a permanent silence — surfaces in code review, not hidden in UI.
3. Stop scraping the host entirely
Best for: retired hosts, dev sandboxes you don't care about.
- Remove from
inventory/hosts.yml, or - Set
enable_monitoring: falsein itshost_vars, or - Add an explicit drop in
prometheus.yml.j2scrape config.
After redeploy, the host disappears from /targets and no rules can fire on it.
4. Inhibit one alert via another (advanced)
alertmanager.yml.j2 already has:
inhibit_rules:
- source_matchers: [severity="critical"]
target_matchers: [severity="warning"]
equal: [alertname, instance]
Reads as: when a critical fires, suppress matching warnings on the same alertname+instance — avoids "host down" + "host high load" duplicates.
Firewall (VyOS)
The metrics LXC is in SERVERS. To talk to other zones it traverses VyOS.
Required rules in SERVERS-SCAN chain (used for SERVERS → DMZ/HOMELAB/IOT/GUEST/TRUSTED):
| Rule | Purpose |
|---|---|
| 200 | Prometheus scrape — 10.69.20.78 → */9100 tcp |
| 210 | Alertmanager → hookshot — 10.69.20.78 → 10.69.70.40/9000 tcp |
Required rule in SERVERS-MGMT:
| Rule | Purpose |
|---|---|
| 50 | Prometheus scrape MGMT — 10.69.20.78 → */9100 tcp |
Cross-zone matrix from SERVERS:
| To zone | Chain |
|---|---|
| MGMT | SERVERS-MGMT |
| DMZ / HOMELAB / IOT / GUEST / TRUSTED | SERVERS-SCAN |
If a target shows context deadline exceeded and the host is up, suspect missing firewall rule.
Credentials
| Item | Where |
|---|---|
Grafana local admin (admin) | vault_grafana_admin_password — fallback if OIDC down |
| Grafana OIDC client_secret | vault_grafana_oidc_client_secret |
| Alertmanager → hookshot URL | vault_alertmanager_hookshot_webhook_url |
All in inventory/host_vars/metrics/vault.yml (ansible-vault encrypted).
Operations
Reload config without restart
curl -X POST http://10.69.20.78:9090/-/reload
curl -X POST http://10.69.20.78:9093/-/reload
The Ansible role calls these via handlers when the rendered config changes — usually no manual step needed.
Restart the stack
systemctl restart prometheus etc. is safe — pod port-forwards stay intact (this is exactly why we run them in a pod).
Restart the pod itself only if you actually need to recreate the netns:
systemctl restart metrics-pod
# containers will follow because they have Pod=metrics.pod
Update images
Quadlets carry AutoUpdate=registry. The podman-auto-update.timer (enabled by roles/podman) pulls new :latest images daily and restarts containers. Manual:
podman auto-update
# or single container:
podman pull docker.io/prom/prometheus:latest
systemctl restart prometheus
Add a Grafana dashboard
UI: Dashboards → New → Import → paste an ID from grafana.com → select Prometheus datasource → Import. Recommended starters:
- Node Exporter Full (1860)
- Alertmanager (9578) — note: requires
instance=alertmanager:9093selected (not "All") for the single-stat panels to render - Prometheus 2.0 Stats (3662)
For permanent dashboards, drop JSON files into /srv/metrics/grafana/provisioning/dashboards/ (host-side) and add a provider yaml under /etc/grafana/provisioning/dashboards/.
Wipe TSDB
systemctl stop prometheus
rm -rf /srv/metrics/prometheus/data/*
systemctl start prometheus
Alertmanager / Grafana data dirs follow the same pattern.
Force-send queued alerts (testing)
systemctl restart alertmanager
Wipes in-memory notify-state, so still-firing alerts re-send immediately on next eval cycle. Only use during testing — in production it spams the alert channel.
Troubleshooting
"alerts dashboard suddenly down after restart"
After systemctl restart alertmanager, the UI / scrape returns refused or hangs.
Cause: netavark race when a container detaches+rejoins a shared network. Old DNAT rules persist, new rules conflict.
Fix (now that we use a pod): shouldn't happen — ports are bound at pod level, not per-container. If it does:
systemctl stop prometheus alertmanager grafana
nft delete table inet netavark
systemctl restart metrics-pod
sleep 2
systemctl start prometheus alertmanager grafana
nft list table inet netavark | grep -E 'dport (9090|9093|3000) dnat' # want 3 lines
The pod's [Service] ExecStartPre=-/usr/sbin/nft delete table inet netavark clears stale rules whenever the pod (re)starts.
Prometheus target shows connection refused
node_exporter not running on that host. Check:
ssh <host> systemctl status node_exporter
If absent → run monitoring role on it (enable_monitoring: true in its group_vars + ansible-playbook playbooks/site.yml --limit <host>).
Prometheus target shows context deadline exceeded
Network blocked. Check VyOS firewall rules for SERVERS → that zone on port 9100. See Firewall section.
Alertmanager logs connect: connection timed out on webhook
Same as above — VyOS blocking SERVERS → DMZ/wherever on the webhook port (typically 10.69.70.40:9000 for hookshot). Add a rule to SERVERS-SCAN.
"hookshot bot ignores !hookshot webhook ..."
- Check
permissions:block exists in hookshot config (grep permissions /srv/hookshot/config.yml). - Your MXID listed in
hookshot_admin_mxids. - Bot has Moderator power level (PL ≥ 50) in the room.
- Hookshot was restarted after config change.
Grafana login → "Client ID Error"
Authentik provider's auto-generated client_id ≠ grafana. Edit the provider in Authentik admin and set Client ID to literally grafana (or update metrics_grafana_oidc_client_id to match what Authentik generated).
Grafana login → "Failed to resolve application"
Authentik has the Provider but no Application bound. Create Application, slug = grafana, link to provider, set policy engine mode = any.
Alert fires in Prometheus but no Matrix message
- Check Alertmanager UI at
https://alerts.home.helix9.org— is the alert there? - If yes, is it under the
matrixreceiver?podman logs alertmanager | grep webhook - Webhook delivery error? Likely firewall (see above).
- No error but no Matrix message: webhook URL in vault might be stale (room recreated, hookshot bot kicked, etc.). Re-create webhook with
!hookshot webhook ...and update vault.
"I'm waiting for the alert to re-send and it doesn't"
repeat_interval: 4h cooldown. Either wait, lower the interval in alertmanager.yml.j2, or systemctl restart alertmanager to reset notify state.
Containers can't talk to each other
Inside the pod, everything is localhost. If you wrote http://prometheus:9090 in a config, change it to http://localhost:9090 — the pod doesn't have container DNS for sibling names.
Tips
-
Test one alert end-to-end before adding many. Add a
vector(1)rule labelledseverity=warning, watch it travel: Prometheus alerts page → Alertmanager → Matrix room. If any hop fails, fix that before scaling rules. -
Use
amtoolfor muting during deploys.podman exec alertmanager amtool silence add alertname=~".+" \--duration=2h --comment="deploy in progress" -
Group thoughtfully. Default groups by
[alertname, severity]— works for most. If you start getting 50-alert pages, group byinstanceinstead. -
Don't trust :latest forever. When
podman-auto-updatebrings a breaking change (e.g., Grafana 12 → 13 OIDC behavior), pin a specific tag inmetrics_grafana_versionuntil you've validated. -
Watch your own monitoring. Prometheus self-scrape + a
up{job="prometheus"} == 0rule is a chicken-and-egg classic — if Prometheus is down, no one alerts. Run a separate cheap check (uptime-kuma) againsthttps://prometheus.home.helix9.org/-/ready. -
For Grafana dashboards from the community, set the
instancetemplate variable explicitly rather than "All" — many community dashboards' single-stat panels break with multi-value variables. -
Don't use HTTPS endpoints in
prometheus.yml.j2scrape targets unless you've configuredtls_config. Stick to internal IPs over plain HTTP for scrape; reserve HTTPS for the human-facing Traefik routes.
Known Issues / Caveats
First-time chown
The role chowns data dirs to in-image UIDs (Prometheus/Alertmanager 65534, Grafana 472). If you bind-mount a directory created elsewhere by hand, re-run the role.
Bootstrap chicken-and-egg with Authentik
If Authentik is down, Grafana OIDC fails. The local admin password (vault_grafana_admin_password) still works at /login for emergency access.
:latest image lock
AutoUpdate=registry only updates if the registry digest changes. Grafana, Prometheus, Alertmanager all push :latest quickly — expect monthly rolls. Pin a specific tag for stability if needed.
Pod restart kills all containers
systemctl restart metrics-pod will stop and restart all three containers. Do this only when port-forward state is broken (very rare with the pod model). Normal config changes use single-container restart or /-/reload.