Metrics — Prometheus + Grafana + Alertmanager

Web UIs:

Grafana: https://grafana.home.helix9.org (Authentik OIDC login)
Prometheus: https://prometheus.home.helix9.org (LAN + Authentik forward-auth)
Alertmanager: https://alerts.home.helix9.org (LAN + Authentik forward-auth)

Host: metrics / metrics.home.lab IP: 10.69.20.78 VLAN: 20 (SERVERS) VMID: 278

Overview

Single Rocky Linux 10 LXC running three Podman containers grouped in a single pod (metrics.pod). The pod gives them a shared network namespace, so containers reach each other on localhost and host-side port-forwards are bound at the pod level (not the container level — that detail matters, see Troubleshooting → "alerts dashboard suddenly down after restart").

clients ─► Traefik ─► Grafana ─► Prometheus ─► node_exporter (every LXC, :9100)
                                      │
                                      └─► Alertmanager ─► hookshot ─► Matrix room

Prometheus scrapes node_exporter on every LXC in mgmt, servers, and dmz groups (loop driven by inventory), plus Prometheus and Alertmanager themselves. Alert rules route through Alertmanager → hookshot generic webhook → Matrix.

Grafana authenticates via Authentik OIDC. Group grafana-admins → Admin role, grafana-editors → Editor, fallback Viewer.

Infrastructure

LXC Container

Setting	Value
Node	pve02
VMID	278
IP	10.69.20.78/24
Gateway	10.69.20.1
CPU	2 cores
RAM	4096 MB
Swap	1024 MB
Disk	16 GB
Template	rockylinux-10-default
Unprivileged	yes

Ansible

Playbook: playbooks/metrics.yml
Role: roles/metrics/
Host vars: inventory/host_vars/metrics/{vars,vault}.yml
Runtime: Podman Quadlet — 1 pod + 3 containers
Images: docker.io/prom/prometheus:latest, docker.io/grafana/grafana:latest, docker.io/prom/alertmanager:latest
Data dir: /srv/metrics/{prometheus,grafana,alertmanager}

Quadlets

Files generated under /etc/containers/systemd/:

metrics.pod → systemd unit metrics-pod.service
prometheus.container → prometheus.service
alertmanager.container → alertmanager.service
grafana.container → grafana.service

systemctl status metrics-pod prometheus grafana alertmanager
systemctl restart prometheus
podman pod ps                # 1 pod, 4 containers (3 + infra)
podman ps
podman logs prometheus

Published Ports

Port	Service	Notes
9090	Prometheus	API + UI; published by pod
3000	Grafana	UI; published by pod
9093	Alertmanager	API + UI; published by pod

Inside the pod everything is localhost — Grafana datasource is http://localhost:9090, Prometheus's alertmanager target is localhost:9093, etc.

Scrape Targets

Generated from inventory by prometheus.yml.j2. Initial jobs:

- job_name: node           # node_exporter on every LXC
- job_name: prometheus     # self-scrape
- job_name: alertmanager   # self-scrape

To add another job, edit roles/metrics/templates/prometheus.yml.j2 and append, e.g.:

  - job_name: traefik
    static_configs:
      - targets: ["{{ hostvars['traefik'].ansible_host }}:8082"]

  - job_name: technitium
    metrics_path: /api/metrics
    static_configs:
      - targets: ["{{ hostvars['technitium'].ansible_host }}:5380"]

Re-run ansible-playbook playbooks/metrics.yml — handler hits POST /-/reload, no restart needed.

Authentik OIDC Provider (manual one-time setup)

Authentik admin → Applications → Providers → Create → OAuth2/OpenID Provider:

Name: grafana
Client ID: grafana (must match metrics_grafana_oidc_client_id in role defaults — Authentik auto-generates a random one, edit it back to grafana)
Client Secret: copy → vault_grafana_oidc_client_secret
Redirect URI: https://grafana.home.helix9.org/login/generic_oauth
Signing key: default

Applications → Applications → Create:

Name: Grafana
Slug: grafana
Provider: grafana
Policy engine mode: any

Directory → Groups → Create grafana-admins, grafana-editors. Add users.

Save secret to vault:

ansible-vault edit inventory/host_vars/metrics/vault.yml

Redeploy: ansible-playbook playbooks/metrics.yml.

Group → Grafana role mapping is set in grafana.env.j2 via GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH:

member of grafana-admins → Admin
member of grafana-editors → Editor
otherwise → Viewer

Alert Routing → Matrix

Alertmanager has one receiver: matrix → hookshot generic webhook URL stored in vault_alertmanager_hookshot_webhook_url. Hookshot bridges to a Matrix room.

Create the webhook in hookshot (one-time):

Invite hookshot bot to your alerts room.
Make the bot a Moderator (Element → room settings → Roles & permissions → power level 50+).
In room: !hookshot webhook alertmanager
Bot replies with the URL — paste into vault.

Vault var:

vault_alertmanager_hookshot_webhook_url: "http://10.69.70.40:9000/webhook/<token>"

The hookshot bot must have a permissions: block in its config granting your MXID admin level — see roles/hookshot defaults hookshot_admin_mxids.

Cooldown (`repeat_interval`)

alertmanager.yml.j2:

route:
  group_wait: 30s         # bundle bursts
  group_interval: 5m      # how often to send updates while still firing
  repeat_interval: 4h     # how often to re-send unchanged firing alert

If you keep waiting for a re-send and don't see one, you're in repeat_interval. Reduce to 1h or 2h for a tighter feedback loop. To force-send while testing: systemctl restart alertmanager (resets in-memory notify state).

Initial alert rules

alert-rules.yml.j2:

NodeDown (5m no scrape) — critical
NodeHighMemoryUsage (>90% for 15m) — warning
NodeRootDiskFillingUp (<10% free for 30m) — warning
NodeHighLoad (load5 > 2x cores for 15m) — warning

Add per-service rules incrementally so first paging round isn't 50 noisy alerts.

Silencing Alerts

Three ways depending on intent:

1. One-off / temporary (Alertmanager UI)

Best for: planned maintenance, transient noise, "it'll be back tomorrow".

https://alerts.home.helix9.org → Silences → New Silence:

Matcher: instance="10.69.70.50:9100" or alertname="NodeDown" (or both)
Duration: hours or days
Comment: explain why
Create

Auto-expires. Listed under Silences tab; can be deleted early.

API equivalent:

amtool silence add alertname=NodeDown instance=10.69.70.50:9100 \
  --duration=24h --comment="rebuilding LXC" --author=marko \
  --alertmanager.url=http://10.69.20.78:9093

2. Exclude permanently in the rule (code)

Best for: a host that is gone but inventory still lists it, or a known-flaky check you don't care about.

Edit roles/metrics/templates/alert-rules.yml.j2:

- alert: NodeDown
  expr: up{job="node", instance!="10.69.70.50:9100"} == 0
  ...

Multiple:

expr: up{job="node", instance!~"10\\.69\\.70\\.50:9100|10\\.69\\.70\\.10:9100"} == 0

Redeploy. Cleaner than a permanent silence — surfaces in code review, not hidden in UI.

3. Stop scraping the host entirely

Best for: retired hosts, dev sandboxes you don't care about.

Remove from inventory/hosts.yml, or
Set enable_monitoring: false in its host_vars, or
Add an explicit drop in prometheus.yml.j2 scrape config.

After redeploy, the host disappears from /targets and no rules can fire on it.

4. Inhibit one alert via another (advanced)

alertmanager.yml.j2 already has:

inhibit_rules:
  - source_matchers: [severity="critical"]
    target_matchers: [severity="warning"]
    equal: [alertname, instance]

Reads as: when a critical fires, suppress matching warnings on the same alertname+instance — avoids "host down" + "host high load" duplicates.

Firewall (VyOS)

The metrics LXC is in SERVERS. To talk to other zones it traverses VyOS.

Required rules in SERVERS-SCAN chain (used for SERVERS → DMZ/HOMELAB/IOT/GUEST/TRUSTED):

Rule	Purpose
200	Prometheus scrape — `10.69.20.78 → */9100 tcp`
210	Alertmanager → hookshot — `10.69.20.78 → 10.69.70.40/9000 tcp`

Required rule in SERVERS-MGMT:

Rule	Purpose
50	Prometheus scrape MGMT — `10.69.20.78 → */9100 tcp`

Cross-zone matrix from SERVERS:

To zone	Chain
MGMT	SERVERS-MGMT
DMZ / HOMELAB / IOT / GUEST / TRUSTED	SERVERS-SCAN

If a target shows context deadline exceeded and the host is up, suspect missing firewall rule.

Credentials

Item	Where
Grafana local admin (`admin`)	`vault_grafana_admin_password` — fallback if OIDC down
Grafana OIDC client_secret	`vault_grafana_oidc_client_secret`
Alertmanager → hookshot URL	`vault_alertmanager_hookshot_webhook_url`

All in inventory/host_vars/metrics/vault.yml (ansible-vault encrypted).

Operations

Reload config without restart

curl -X POST http://10.69.20.78:9090/-/reload
curl -X POST http://10.69.20.78:9093/-/reload

The Ansible role calls these via handlers when the rendered config changes — usually no manual step needed.

Restart the stack

systemctl restart prometheus etc. is safe — pod port-forwards stay intact (this is exactly why we run them in a pod).

Restart the pod itself only if you actually need to recreate the netns:

systemctl restart metrics-pod
# containers will follow because they have Pod=metrics.pod

Update images

Quadlets carry AutoUpdate=registry. The podman-auto-update.timer (enabled by roles/podman) pulls new :latest images daily and restarts containers. Manual:

podman auto-update
# or single container:
podman pull docker.io/prom/prometheus:latest
systemctl restart prometheus

Add a Grafana dashboard

UI: Dashboards → New → Import → paste an ID from grafana.com → select Prometheus datasource → Import. Recommended starters:

Node Exporter Full (1860)
Alertmanager (9578) — note: requires instance=alertmanager:9093 selected (not "All") for the single-stat panels to render
Prometheus 2.0 Stats (3662)

For permanent dashboards, drop JSON files into /srv/metrics/grafana/provisioning/dashboards/ (host-side) and add a provider yaml under /etc/grafana/provisioning/dashboards/.

Wipe TSDB

systemctl stop prometheus
rm -rf /srv/metrics/prometheus/data/*
systemctl start prometheus

Alertmanager / Grafana data dirs follow the same pattern.

Force-send queued alerts (testing)

systemctl restart alertmanager

Wipes in-memory notify-state, so still-firing alerts re-send immediately on next eval cycle. Only use during testing — in production it spams the alert channel.

Troubleshooting

"alerts dashboard suddenly down after restart"

After systemctl restart alertmanager, the UI / scrape returns refused or hangs.

Cause: netavark race when a container detaches+rejoins a shared network. Old DNAT rules persist, new rules conflict.

Fix (now that we use a pod): shouldn't happen — ports are bound at pod level, not per-container. If it does:

systemctl stop prometheus alertmanager grafana
nft delete table inet netavark
systemctl restart metrics-pod
sleep 2
systemctl start prometheus alertmanager grafana
nft list table inet netavark | grep -E 'dport (9090|9093|3000) dnat'   # want 3 lines

The pod's [Service] ExecStartPre=-/usr/sbin/nft delete table inet netavark clears stale rules whenever the pod (re)starts.

Prometheus target shows `connection refused`

node_exporter not running on that host. Check:

ssh <host> systemctl status node_exporter

If absent → run monitoring role on it (enable_monitoring: true in its group_vars + ansible-playbook playbooks/site.yml --limit <host>).

Prometheus target shows `context deadline exceeded`

Network blocked. Check VyOS firewall rules for SERVERS → that zone on port 9100. See Firewall section.

Alertmanager logs `connect: connection timed out` on webhook

Same as above — VyOS blocking SERVERS → DMZ/wherever on the webhook port (typically 10.69.70.40:9000 for hookshot). Add a rule to SERVERS-SCAN.

"hookshot bot ignores `!hookshot webhook ...`"

Check permissions: block exists in hookshot config (grep permissions /srv/hookshot/config.yml).
Your MXID listed in hookshot_admin_mxids.
Bot has Moderator power level (PL ≥ 50) in the room.
Hookshot was restarted after config change.

Authentik provider's auto-generated client_id ≠ grafana. Edit the provider in Authentik admin and set Client ID to literally grafana (or update metrics_grafana_oidc_client_id to match what Authentik generated).

Authentik has the Provider but no Application bound. Create Application, slug = grafana, link to provider, set policy engine mode = any.

Alert fires in Prometheus but no Matrix message

Check Alertmanager UI at https://alerts.home.helix9.org — is the alert there?
If yes, is it under the matrix receiver? podman logs alertmanager | grep webhook
Webhook delivery error? Likely firewall (see above).
No error but no Matrix message: webhook URL in vault might be stale (room recreated, hookshot bot kicked, etc.). Re-create webhook with !hookshot webhook ... and update vault.

"I'm waiting for the alert to re-send and it doesn't"

repeat_interval: 4h cooldown. Either wait, lower the interval in alertmanager.yml.j2, or systemctl restart alertmanager to reset notify state.

Containers can't talk to each other

Inside the pod, everything is localhost. If you wrote http://prometheus:9090 in a config, change it to http://localhost:9090 — the pod doesn't have container DNS for sibling names.

Tips

Test one alert end-to-end before adding many. Add a vector(1) rule labelled severity=warning, watch it travel: Prometheus alerts page → Alertmanager → Matrix room. If any hop fails, fix that before scaling rules.

Use amtool for muting during deploys.

podman exec alertmanager amtool silence add alertname=~".+" \
  --duration=2h --comment="deploy in progress"

Group thoughtfully. Default groups by [alertname, severity] — works for most. If you start getting 50-alert pages, group by instance instead.
Don't trust :latest forever. When podman-auto-update brings a breaking change (e.g., Grafana 12 → 13 OIDC behavior), pin a specific tag in metrics_grafana_version until you've validated.
Watch your own monitoring. Prometheus self-scrape + a up{job="prometheus"} == 0 rule is a chicken-and-egg classic — if Prometheus is down, no one alerts. Run a separate cheap check (uptime-kuma) against https://prometheus.home.helix9.org/-/ready.
For Grafana dashboards from the community, set the instance template variable explicitly rather than "All" — many community dashboards' single-stat panels break with multi-value variables.
Don't use HTTPS endpoints in prometheus.yml.j2 scrape targets unless you've configured tls_config. Stick to internal IPs over plain HTTP for scrape; reserve HTTPS for the human-facing Traefik routes.

Known Issues / Caveats

First-time chown

The role chowns data dirs to in-image UIDs (Prometheus/Alertmanager 65534, Grafana 472). If you bind-mount a directory created elsewhere by hand, re-run the role.

Bootstrap chicken-and-egg with Authentik

If Authentik is down, Grafana OIDC fails. The local admin password (vault_grafana_admin_password) still works at /login for emergency access.

`:latest` image lock

AutoUpdate=registry only updates if the registry digest changes. Grafana, Prometheus, Alertmanager all push :latest quickly — expect monthly rolls. Pin a specific tag for stability if needed.

Pod restart kills all containers

systemctl restart metrics-pod will stop and restart all three containers. Do this only when port-forward state is broken (very rare with the pod model). Normal config changes use single-container restart or /-/reload.

Overview​

Infrastructure​

LXC Container​

Ansible​

Quadlets​

Published Ports​

Scrape Targets​

Authentik OIDC Provider (manual one-time setup)​

Alert Routing → Matrix​

Cooldown (repeat_interval)​

Initial alert rules​

Silencing Alerts​

1. One-off / temporary (Alertmanager UI)​

2. Exclude permanently in the rule (code)​

3. Stop scraping the host entirely​

4. Inhibit one alert via another (advanced)​

Firewall (VyOS)​

Credentials​

Operations​

Reload config without restart​

Restart the stack​

Update images​

Add a Grafana dashboard​

Wipe TSDB​

Force-send queued alerts (testing)​

Troubleshooting​

"alerts dashboard suddenly down after restart"​

Prometheus target shows connection refused​

Prometheus target shows context deadline exceeded​

Alertmanager logs connect: connection timed out on webhook​

"hookshot bot ignores !hookshot webhook ..."​

Grafana login → "Client ID Error"​

Grafana login → "Failed to resolve application"​

Alert fires in Prometheus but no Matrix message​

"I'm waiting for the alert to re-send and it doesn't"​

Containers can't talk to each other​

Tips​

Known Issues / Caveats​

First-time chown​

Bootstrap chicken-and-egg with Authentik​

:latest image lock​

Pod restart kills all containers​