Skip to main content

Logs — VictoriaLogs + Vector

Web UI: https://logs.home.helix9.org/select/vmui/ (LAN + Authentik forward-auth)

Host: logs / logs.home.helix9.org IP: 10.69.20.79 VLAN: 20 (SERVERS) VMID: 279


Overview

Central log aggregation for every managed LXC. A single Rocky Linux 10 LXC runs VictoriaLogs (single Go binary). Every other LXC runs Vector as a systemd service that tails journald and pushes to VictoriaLogs over the Elasticsearch bulk API.

[LXC: app-a]──┐
[LXC: app-b]──┼─► Vector (journald) ──HTTP──► logs:9428 (VictoriaLogs) ──► VMUI / Grafana
[LXC: app-c]──┘ │
└── /var/lib/victorialogs (30d retention)

VictoriaLogs is also scraped by Prometheus for self-metrics and provisioned as a second datasource in Grafana via the victoriametrics-logs-datasource plugin.


Why VictoriaLogs

  • Single Go binary, ~35 MB RAM at our scale (no JVM, no Elasticsearch).
  • Per-token + columnar index → sub-second full-text queries on millions of rows.
  • ~10-20x lighter than Loki for our workload; 30d of logs fits in megabytes thanks to columnar compression.
  • Native Prometheus /metrics; pairs cleanly with the existing stack.

Vector is the shipper because: small static binary, journald source built-in, runs on every LXC regardless of distro, and supports the VictoriaLogs ES API.


Infrastructure

LXC Container

SettingValue
Nodepve02
VMID279
IP10.69.20.79/24
Gateway10.69.20.1
CPU1 core
RAM1024 MB
Swap512 MB
Disk8 GB
Templaterockylinux-10-default
Unprivilegedyes

Ansible

  • Playbook: playbooks/logs.yml
  • Roles:
    • roles/victorialogs/ — server (binary + systemd unit)
    • roles/log_shipper/ — Vector on every LXC
  • Host vars: inventory/host_vars/logs/vars.yml
  • Runtime: native binaries via systemd (no containers)
  • Data dir: /var/lib/victorialogs

Published Ports

PortServiceNotes
9428VictoriaLogs HTTPIngest API + VMUI + /metrics

VictoriaLogs is also Vector's sink target on every LXC.


Ingest Path

Vector config (/etc/vector/vector.yaml, rendered from roles/log_shipper/templates/vector.yaml.j2):

sources:
journal:
type: journald
current_boot_only: true

transforms:
enrich:
type: remap
inputs: [journal]
source: |
.host = "{{ inventory_hostname }}"
unit = to_string!(._SYSTEMD_UNIT || "unknown")
if match(unit, r'^[0-9a-f]{16,}\.(service|scope|mount)$') {
unit = "transient"
}
.service = unit
.vlan_group = "<group>" # servers / mgmt / dmz

sinks:
victorialogs:
type: elasticsearch
endpoints: ["http://10.69.20.79:9428/insert/elasticsearch/"]
api_version: v8
mode: bulk
healthcheck:
enabled: false # VL does not implement /_cluster/health
bulk:
index: vector-logs
query:
_msg_field: "message"
_time_field: "timestamp"
_stream_fields: "host,service,vlan_group"

Stream fields = how VictoriaLogs partitions data; chosen so each (host, service, vlan_group) triple is one stream — fast filtering, low cardinality.

The transient collapse hides systemd's hex-named scope units (e.g. session scopes) which would otherwise bloat the service field.


Query Language — LogsQL

VMUI accepts LogsQL. Quick examples:

QueryMeaning
*everything
_stream:{host="podman"}one host
_stream:{service="sshd.service"}sshd across the fleet
_stream:{vlan_group="dmz"}every DMZ host
error OR failfull-text across all streams
_stream:{host="metrics"} error _time:1hcombine: stream + text + time
_time:5mlast 5 minutes
_stream:{host="logs"} | stats by (service) count()grouped counts
_stream:{host=~"pve.*"}regex on stream label
_msg:"connection refused"exact substring in body

_time: accepts 5m, 1h, 2026-05-12T00:00:00Z, or [t1, t2] ranges.


Web UI

https://logs.home.helix9.org/select/vmui/

  • Query tab — LogsQL with stream-field auto-complete on the left.
  • Overview tab — stream and field facets with row counts.
  • Live tail — top-right toggle, websocket stream of new events matching the current query.

Auth: local-or-auth middleware in Traefik (LAN bypass + Authentik OIDC for external access).


Grafana Integration

A second datasource — VictoriaLogs — is provisioned in the same metrics LXC's Grafana, alongside Prometheus.

  • Plugin: victoriametrics-logs-datasource (auto-installed via GF_INSTALL_PLUGINS env).
  • Datasource URL: http://10.69.20.79:9428
  • Provisioning file: roles/metrics/templates/grafana-datasource-logs.yml.j2

Grafana → Explore → datasource picker → VictoriaLogs. LogsQL syntax, same as VMUI.


Self-Monitoring

Prometheus scrapes 10.69.20.79:9428/metrics. Notable series:

SeriesUse
vl_rows_ingested_totalingest rate
vl_storage_data_size_byteson-disk size
vl_streams_created_totalnew streams (label cardinality)
vl_http_request_errors_totalingest/query errors
up{job="victorialogs"}liveness

Worth adding alert rules later (none yet) — e.g. up == 0 for 5 min, or sudden drop in vl_rows_ingested_total indicating shippers stopped.


Retention

-retentionPeriod=30d flag in the systemd unit. Older data is dropped automatically on the next storage merge pass.

To change: edit victorialogs_retention in roles/victorialogs/defaults/main.yml (or override per-host in inventory/host_vars/logs/vars.yml) and redeploy:

ansible-playbook playbooks/logs.yml --tags victorialogs

Disk usage at ~10 MB/day raw input compresses to well under 5 MB/day on disk. The 8 GB rootfs is overkill — could shrink further if we want.


Operations

Check server

ssh logs systemctl status victorialogs
ssh logs curl -s localhost:9428/health # → OK
ssh logs du -sh /var/lib/victorialogs

Check a shipper

ssh <host> systemctl status vector
ssh <host> journalctl -u vector -n 50

If Vector logs say Healthcheck disabled. and Starting journalctl. — fine, shipping. ES sink errors mean either VictoriaLogs unreachable or the /_cluster/health healthcheck got re-enabled (it must stay off).

Restart shipper on one host

ssh <host> systemctl restart vector

Restart server

ssh logs systemctl restart victorialogs

Safe — Vector buffers locally during the restart and resumes.

Roll out a Vector config change

ansible-playbook playbooks/logs.yml

Renders new vector.yaml everywhere, restarts Vector on each host via handler.

Add a new field to the stream

Edit roles/log_shipper/templates/vector.yaml.j2 — extend the enrich transform, then add the new field name to _stream_fields in the sink's query: block. Re-run the playbook. Note: new fields only appear on new data; old entries keep old shape.

Wipe storage

ssh logs systemctl stop victorialogs
ssh logs rm -rf /var/lib/victorialogs/*
ssh logs systemctl start victorialogs

Bump VictoriaLogs version

Edit victorialogs_version in inventory/host_vars/logs/vars.yml (overrides the role default). Re-run:

ansible-playbook playbooks/logs.yml --limit logs

Role's version check re-downloads only when needed; handler restarts. Confirm release URL works at https://github.com/VictoriaMetrics/VictoriaLogs/releases — artifact pattern victoria-logs-linux-amd64-vX.Y.Z.tar.gz.

Bump Vector version

Edit vector_version in roles/log_shipper/defaults/main.yml. Re-run playbooks/logs.yml. The role downloads from packages.timber.io/vector/<ver>/... (musl static build).


Troubleshooting

VMUI returns 403

Hitting local-or-auth chain: local-only denied your IP. Confirm your source IP is inside traefik_local_networks (10.69.0.0/16 or 192.168.178.0/24). If denied, you'll get a flat 403 (no redirect to Authentik) because chained middlewares stop on the first failure.

Outside LAN → you'll redirect through Authentik instead.

If DNS for logs.home.helix9.org doesn't yet resolve, the request never reaches Traefik's logs router and you may hit a wildcard fallback. Run playbooks/vyos_dns.yml to register the subdomain.

Vector logs Healthcheck failed. Unexpected status: 400 Bad Request

VictoriaLogs doesn't implement /_cluster/health, which Vector's ES sink probes by default. Confirm healthcheck: { enabled: false } is set in vector.yaml. Re-render via the playbook if missing.

Vector log spam unsupported path requested: /_cluster/health (server side)

Same root cause as above, observed from VictoriaLogs' point of view. Fix in the shipper, not the server.

Logs from one host missing

ssh <host> systemctl status vector
ssh <host> journalctl -u vector -n 100

Common causes:

  • Vector not installed (host not in lxc group? enable_logging: false?).
  • Sink unreachable (firewall blocking <host> → 10.69.20.79:9428).
  • Vector user not in systemd-journal group → no journald read access.

Hex-soup unit names appearing again

Means new transient unit pattern not matched by the regex in the enrich transform. Extend the pattern in vector.yaml.j2:

if match(unit, r'^[0-9a-f]{16,}\.(service|scope|mount)$') { ... }

…adding the new suffix (e.g. slice, timer) if needed.

Stream cardinality explodes

If vl_streams_created_total rate ramps up, something is putting a high-cardinality value into a stream field (e.g. a PID or session ID into service). Bring it back to a small enumerable set. Each unique (host, service, vlan_group) combo is one stream.

Ansible SSH fails — "agent refused operation" or "Connection closed"

ssh-agent has a hardware key (FIDO/SK) loaded; sshd tries it first, agent refuses (touch required), MaxAuthTries exhausted before id_ed25519_ansible is offered. Fix is baked into ansible.cfg:

ssh_args = ... -o IdentitiesOnly=yes -o IdentityAgent=none

If you still see it, confirm those flags are present.


Credentials

None — VictoriaLogs has no auth of its own. Access control is purely network (LAN bypass) + Authentik forward-auth in Traefik for external traffic. The ingest endpoint on 10.69.20.79:9428 is open to anything inside the VLAN mesh, which is fine for now; VyOS limits cross-zone traffic.

If you ever need auth: drop vmauth in front, or use Traefik's basic-auth middleware on a separate ingest hostname.


Known Issues / Caveats

Old transient unit names already stored

The hex-name collapse only applies to new ingest. Old events keep their original service value until 30-day retention drops them. The Stream fields panel in VMUI will keep showing those values until they age out.

No alerting on logs yet

vmalert isn't wired up. Could be added later — Alertmanager is already running in the metrics LXC, so adding vmalert pointing at VictoriaLogs and forwarding to Alertmanager is a small lift.

Single-node, no HA

OSS VictoriaLogs is single-binary; no replication, no clustering. At our volume and SLA that's fine. Back up /var/lib/victorialogs via Proxmox Backup if log retention is precious.

Vector buffers in memory

data_dir: /var/lib/vector is set, but the ES sink uses an in-memory queue by default. A long VictoriaLogs outage will drop events on Vector restart. For larger setups, switch the sink to disk buffer (buffer.type: disk).

Healthcheck must stay off

Vector's ES sink calls GET /_cluster/health which VictoriaLogs answers 400. Vector treats that as a hard failure at startup → Vector won't start. We disable the healthcheck explicitly. Don't remove that line.