Logs — VictoriaLogs + Vector

Web UI: https://logs.home.helix9.org/select/vmui/ (LAN + Authentik forward-auth)

Host: logs / logs.home.helix9.org IP: 10.69.20.79 VLAN: 20 (SERVERS) VMID: 279

Overview

Central log aggregation for every managed LXC. A single Rocky Linux 10 LXC runs VictoriaLogs (single Go binary). Every other LXC runs Vector as a systemd service that tails journald and pushes to VictoriaLogs over the Elasticsearch bulk API.

[LXC: app-a]──┐
[LXC: app-b]──┼─► Vector (journald) ──HTTP──► logs:9428 (VictoriaLogs) ──► VMUI / Grafana
[LXC: app-c]──┘                                       │
                                                      └── /var/lib/victorialogs (30d retention)

VictoriaLogs is also scraped by Prometheus for self-metrics and provisioned as a second datasource in Grafana via the victoriametrics-logs-datasource plugin.

Why VictoriaLogs

Single Go binary, ~35 MB RAM at our scale (no JVM, no Elasticsearch).
Per-token + columnar index → sub-second full-text queries on millions of rows.
~10-20x lighter than Loki for our workload; 30d of logs fits in megabytes thanks to columnar compression.
Native Prometheus /metrics; pairs cleanly with the existing stack.

Vector is the shipper because: small static binary, journald source built-in, runs on every LXC regardless of distro, and supports the VictoriaLogs ES API.

Infrastructure

LXC Container

Setting	Value
Node	pve02
VMID	279
IP	10.69.20.79/24
Gateway	10.69.20.1
CPU	1 core
RAM	1024 MB
Swap	512 MB
Disk	8 GB
Template	rockylinux-10-default
Unprivileged	yes

Ansible

Playbook: playbooks/logs.yml
Roles:
- roles/victorialogs/ — server (binary + systemd unit)
- roles/log_shipper/ — Vector on every LXC
Host vars: inventory/host_vars/logs/vars.yml
Runtime: native binaries via systemd (no containers)
Data dir: /var/lib/victorialogs

Published Ports

Port	Service	Notes
9428	VictoriaLogs HTTP	Ingest API + VMUI + `/metrics`

VictoriaLogs is also Vector's sink target on every LXC.

Ingest Path

Vector config (/etc/vector/vector.yaml, rendered from roles/log_shipper/templates/vector.yaml.j2):

sources:
  journal:
    type: journald
    current_boot_only: true

transforms:
  enrich:
    type: remap
    inputs: [journal]
    source: |
      .host = "{{ inventory_hostname }}"
      unit = to_string!(._SYSTEMD_UNIT || "unknown")
      if match(unit, r'^[0-9a-f]{16,}\.(service|scope|mount)$') {
        unit = "transient"
      }
      .service = unit
      .vlan_group = "<group>"   # servers / mgmt / dmz

sinks:
  victorialogs:
    type: elasticsearch
    endpoints: ["http://10.69.20.79:9428/insert/elasticsearch/"]
    api_version: v8
    mode: bulk
    healthcheck:
      enabled: false   # VL does not implement /_cluster/health
    bulk:
      index: vector-logs
    query:
      _msg_field: "message"
      _time_field: "timestamp"
      _stream_fields: "host,service,vlan_group"

Stream fields = how VictoriaLogs partitions data; chosen so each (host, service, vlan_group) triple is one stream — fast filtering, low cardinality.

The transient collapse hides systemd's hex-named scope units (e.g. session scopes) which would otherwise bloat the service field.

Query Language — LogsQL

VMUI accepts LogsQL. Quick examples:

Query	Meaning
`*`	everything
`_stream:{host="podman"}`	one host
`_stream:{service="sshd.service"}`	sshd across the fleet
`_stream:{vlan_group="dmz"}`	every DMZ host
`error OR fail`	full-text across all streams
`_stream:{host="metrics"} error _time:1h`	combine: stream + text + time
`_time:5m`	last 5 minutes
`_stream:{host="logs"} \| stats by (service) count()`	grouped counts
`_stream:{host=~"pve.*"}`	regex on stream label
`_msg:"connection refused"`	exact substring in body

_time: accepts 5m, 1h, 2026-05-12T00:00:00Z, or [t1, t2] ranges.

Web UI

https://logs.home.helix9.org/select/vmui/

Query tab — LogsQL with stream-field auto-complete on the left.
Overview tab — stream and field facets with row counts.
Live tail — top-right toggle, websocket stream of new events matching the current query.

Auth: local-or-auth middleware in Traefik (LAN bypass + Authentik OIDC for external access).

Grafana Integration

A second datasource — VictoriaLogs — is provisioned in the same metrics LXC's Grafana, alongside Prometheus.

Plugin: victoriametrics-logs-datasource (auto-installed via GF_INSTALL_PLUGINS env).
Datasource URL: http://10.69.20.79:9428
Provisioning file: roles/metrics/templates/grafana-datasource-logs.yml.j2

Grafana → Explore → datasource picker → VictoriaLogs. LogsQL syntax, same as VMUI.

Self-Monitoring

Prometheus scrapes 10.69.20.79:9428/metrics. Notable series:

Series	Use
`vl_rows_ingested_total`	ingest rate
`vl_storage_data_size_bytes`	on-disk size
`vl_streams_created_total`	new streams (label cardinality)
`vl_http_request_errors_total`	ingest/query errors
`up{job="victorialogs"}`	liveness

Worth adding alert rules later (none yet) — e.g. up == 0 for 5 min, or sudden drop in vl_rows_ingested_total indicating shippers stopped.

Retention

-retentionPeriod=30d flag in the systemd unit. Older data is dropped automatically on the next storage merge pass.

To change: edit victorialogs_retention in roles/victorialogs/defaults/main.yml (or override per-host in inventory/host_vars/logs/vars.yml) and redeploy:

ansible-playbook playbooks/logs.yml --tags victorialogs

Disk usage at ~10 MB/day raw input compresses to well under 5 MB/day on disk. The 8 GB rootfs is overkill — could shrink further if we want.

Operations

Check server

ssh logs systemctl status victorialogs
ssh logs curl -s localhost:9428/health        # → OK
ssh logs du -sh /var/lib/victorialogs

Check a shipper

ssh <host> systemctl status vector
ssh <host> journalctl -u vector -n 50

If Vector logs say Healthcheck disabled. and Starting journalctl. — fine, shipping. ES sink errors mean either VictoriaLogs unreachable or the /_cluster/health healthcheck got re-enabled (it must stay off).

Restart shipper on one host

ssh <host> systemctl restart vector

Restart server

ssh logs systemctl restart victorialogs

Safe — Vector buffers locally during the restart and resumes.

Roll out a Vector config change

ansible-playbook playbooks/logs.yml

Renders new vector.yaml everywhere, restarts Vector on each host via handler.

Add a new field to the stream

Edit roles/log_shipper/templates/vector.yaml.j2 — extend the enrich transform, then add the new field name to _stream_fields in the sink's query: block. Re-run the playbook. Note: new fields only appear on new data; old entries keep old shape.

Wipe storage

ssh logs systemctl stop victorialogs
ssh logs rm -rf /var/lib/victorialogs/*
ssh logs systemctl start victorialogs

Bump VictoriaLogs version

Edit victorialogs_version in inventory/host_vars/logs/vars.yml (overrides the role default). Re-run:

ansible-playbook playbooks/logs.yml --limit logs

Role's version check re-downloads only when needed; handler restarts. Confirm release URL works at https://github.com/VictoriaMetrics/VictoriaLogs/releases — artifact pattern victoria-logs-linux-amd64-vX.Y.Z.tar.gz.

Bump Vector version

Edit vector_version in roles/log_shipper/defaults/main.yml. Re-run playbooks/logs.yml. The role downloads from packages.timber.io/vector/<ver>/... (musl static build).

Troubleshooting

VMUI returns 403

Hitting local-or-auth chain: local-only denied your IP. Confirm your source IP is inside traefik_local_networks (10.69.0.0/16 or 192.168.178.0/24). If denied, you'll get a flat 403 (no redirect to Authentik) because chained middlewares stop on the first failure.

Outside LAN → you'll redirect through Authentik instead.

If DNS for logs.home.helix9.org doesn't yet resolve, the request never reaches Traefik's logs router and you may hit a wildcard fallback. Run playbooks/vyos_dns.yml to register the subdomain.

Vector logs `Healthcheck failed. Unexpected status: 400 Bad Request`

VictoriaLogs doesn't implement /_cluster/health, which Vector's ES sink probes by default. Confirm healthcheck: { enabled: false } is set in vector.yaml. Re-render via the playbook if missing.

Vector log spam `unsupported path requested: /_cluster/health` (server side)

Same root cause as above, observed from VictoriaLogs' point of view. Fix in the shipper, not the server.

Logs from one host missing

ssh <host> systemctl status vector
ssh <host> journalctl -u vector -n 100

Common causes:

Vector not installed (host not in lxc group? enable_logging: false?).
Sink unreachable (firewall blocking <host> → 10.69.20.79:9428).
Vector user not in systemd-journal group → no journald read access.

Hex-soup unit names appearing again

Means new transient unit pattern not matched by the regex in the enrich transform. Extend the pattern in vector.yaml.j2:

if match(unit, r'^[0-9a-f]{16,}\.(service|scope|mount)$') { ... }

…adding the new suffix (e.g. slice, timer) if needed.

Stream cardinality explodes

If vl_streams_created_total rate ramps up, something is putting a high-cardinality value into a stream field (e.g. a PID or session ID into service). Bring it back to a small enumerable set. Each unique (host, service, vlan_group) combo is one stream.

Ansible SSH fails — "agent refused operation" or "Connection closed"

ssh-agent has a hardware key (FIDO/SK) loaded; sshd tries it first, agent refuses (touch required), MaxAuthTries exhausted before id_ed25519_ansible is offered. Fix is baked into ansible.cfg:

ssh_args = ... -o IdentitiesOnly=yes -o IdentityAgent=none

If you still see it, confirm those flags are present.

Credentials

None — VictoriaLogs has no auth of its own. Access control is purely network (LAN bypass) + Authentik forward-auth in Traefik for external traffic. The ingest endpoint on 10.69.20.79:9428 is open to anything inside the VLAN mesh, which is fine for now; VyOS limits cross-zone traffic.

If you ever need auth: drop vmauth in front, or use Traefik's basic-auth middleware on a separate ingest hostname.

Known Issues / Caveats

Old transient unit names already stored

The hex-name collapse only applies to new ingest. Old events keep their original service value until 30-day retention drops them. The Stream fields panel in VMUI will keep showing those values until they age out.

No alerting on logs yet

vmalert isn't wired up. Could be added later — Alertmanager is already running in the metrics LXC, so adding vmalert pointing at VictoriaLogs and forwarding to Alertmanager is a small lift.

Single-node, no HA

OSS VictoriaLogs is single-binary; no replication, no clustering. At our volume and SLA that's fine. Back up /var/lib/victorialogs via Proxmox Backup if log retention is precious.

Vector buffers in memory

data_dir: /var/lib/vector is set, but the ES sink uses an in-memory queue by default. A long VictoriaLogs outage will drop events on Vector restart. For larger setups, switch the sink to disk buffer (buffer.type: disk).

Healthcheck must stay off

Vector's ES sink calls GET /_cluster/health which VictoriaLogs answers 400. Vector treats that as a hard failure at startup → Vector won't start. We disable the healthcheck explicitly. Don't remove that line.

Overview​

Why VictoriaLogs​

Infrastructure​

LXC Container​

Ansible​

Published Ports​

Ingest Path​

Query Language — LogsQL​

Web UI​

Grafana Integration​

Self-Monitoring​

Retention​

Operations​

Check server​

Check a shipper​

Restart shipper on one host​

Restart server​

Roll out a Vector config change​

Add a new field to the stream​

Wipe storage​

Bump VictoriaLogs version​

Bump Vector version​

Troubleshooting​

VMUI returns 403​

Vector logs Healthcheck failed. Unexpected status: 400 Bad Request​

Vector log spam unsupported path requested: /_cluster/health (server side)​

Logs from one host missing​

Hex-soup unit names appearing again​

Stream cardinality explodes​

Ansible SSH fails — "agent refused operation" or "Connection closed"​

Credentials​

Known Issues / Caveats​

Old transient unit names already stored​

No alerting on logs yet​

Single-node, no HA​

Vector buffers in memory​

Healthcheck must stay off​