Metrics
Metrics are the first signal of system health, but collecting them at scale is only the start. In my bare-metal Talos Linux cluster, I needed to instrument everything from Longhorn and Postgres to the monitoring stack itself. The challenge isn’t just collecting metrics; it’s also structuring, aggregating, and alerting on them in a way that surfaces meaningful operational insight without overwhelming the system.
Below is a live dashboard showing the ~500k active series at ~20k samples/sec that the described setup is handling.
What follows is everything that went into standing this up.Goals
The monitoring system was designed to deliver fast, reliable insights while remaining resilient under failures. The key objectives focused on data availability, query responsiveness, timely alerts, and fault tolerance.
- High data availability: 99.9% of metrics should be ingested and accessible within 60 seconds.
- Responsive queries: Ad hoc queries across any dimension should return results quickly enough to feel seamless to the user.
- Timely alerts: All alerts should trigger within five minutes of the underlying condition.
- Single-node fault tolerance: The system should remain fully operational if any single Kubernetes node fails, with no impact on end-user experience.
- Multi-node fault tolerance: The system should tolerate any two non-critical node failures without data loss, except in cases where a failed node is actively scraping metrics, where temporary data gaps are acceptable.
Constraints
The monitoring stack had to operate within a constrained and unreliable compute environment. The Kubernetes cluster consists of 11 worker nodes with relatively small instance sizes (4–12 vCPU, 6–48 GB RAM) and consistently high workload utilization. As a result, observability components must impose minimal additional CPU and memory overhead.
Infrastructure reliability is also a factor. Nodes periodically experience extreme CPU steal time (occasionally exceeding 80%) and sustained I/O wait spikes (sometimes greater than 20 s). These conditions can temporarily degrade or stall individual nodes without necessarily causing full node failures.
Given these characteristics, the design requirement is continuous metric coverage despite partial node impairment. A single node becoming unavailable—or temporarily stalled due to steal or I/O wait—should not create gaps in metric collection. The monitoring system therefore needs to tolerate node loss and degraded nodes while avoiding heavyweight high-availability patterns that would significantly increase cluster resource consumption.
Predicted Failure Modes
Given the resource constraints and underlying infrastructure instability, several failure modes are expected and must be tolerated by the monitoring architecture.
-
Node unavailability exceeding scrape intervals. Metric collection relies on periodic scraping of workloads running on cluster nodes. If a node becomes unavailable or stalls longer than the configured scrape interval, the collector responsible for that target cannot obtain metrics for that cycle. In practice this can occur not only from node failure but also from extreme CPU steal or prolonged I/O wait. The system therefore assumes that short gaps may occur and focuses on preventing single-node failure from causing cluster-wide coverage loss.
-
Collector loss or shard interruption. Scraping is performed by collectors running on cluster nodes. If a collector instance fails or becomes stalled, the targets assigned to that shard may temporarily go unscraped until rescheduling or recovery occurs. In a highly constrained cluster, rescheduling latency itself may exceed one or more scrape intervals.
-
Retry storms and ingestion pressure. When collectors or network paths recover after a disruption, buffered samples may be retried simultaneously. In a resource-constrained cluster this can create a thundering-herd effect at the ingestion layer, temporarily overwhelming the metrics backend and increasing request latency or failure rates.
-
Out-of-order sample generation. Recovery from collector interruption can also produce delayed samples that arrive after newer ones. Large volumes of out-of-order samples significantly increase ingestion cost because the storage layer must perform additional validation and ordering work. Minimizing these conditions is therefore an explicit operational goal.
Design
Several options were evaluated for metrics storage: Prometheus, VictoriaMetrics, Thanos, and Grafana Mimir.
A traditional high-availability Prometheus deployment was rejected because it requires duplicate scrapers and query-time deduplication, which increases CPU overhead. Long-term storage also typically depends on object-store integrations that add operational complexity disproportionate to the cluster’s size.
VictoriaMetrics was considered but not selected due to weaker handling of very high-cardinality workloads and a less mature ecosystem compared to Cortex-derived systems.
This left Thanos and Mimir, both of which provide horizontally scalable, Prometheus-compatible storage backed by object storage. Mimir was selected because its centralized architecture reduces query fan-out and coordination overhead, resulting in lower resource usage and better query efficiency in a resource-constrained cluster.
Metric Collection
Metrics are scraped using Grafana Alloy rather than standalone Prometheus instances. Alloy supports clustered operation with distributed target sharding, allowing scrape workloads to be automatically partitioned across collectors and automatically redistributed if a node becomes unavailable. This provides scrape-layer high availability without the duplicate scrapers and ingest deduplication that would be required by Mimir.
Alloy was also selected because the observability stack was expected to expand beyond metrics. Since Alloy natively supports metrics, logs, traces, and more within a single configuration, adopting it early avoided introducing a separate log ingestion system later.
Scraping Implementation
Implementing Alloy meant scraping kubelet endpoints, which requires a cluster role patch on each node. The following snippet shows the minimal RBAC change applied to allow Alloy to collect node-level metrics:
{
"op": "add",
"path": "/rules/-",
"value": {
"apiGroups": [""],
"resources": ["nodes/metrics"],
"verbs": ["get"]
}
}
This was applied in a simple patchesJson6902 section via Kustomize.
Kustomize was selected to render the helm chart because it facilitates rolling multiple discrete alloy configuration files into one ConfigMap that the Alloy configuration reloader then reads.
With this in place, Alloy can automatically discover nodes and wire them into Prometheus-style scraping for kubelet metrics. A sample configuration illustrating node discovery and kubelet scraping looks like this:
discovery.kubernetes "nodes" {
role = "node"
}
prometheus.scrape "kubelet_metrics_cadvisor" {
targets = discovery.kubernetes.nodes.targets
forward_to = [prometheus.remote_write.default.receiver]
bearer_token_file = "/var/run/secrets/kubernetes.io/serviceaccount/token"
metrics_path = "/metrics/cadvisor"
}
For pod- and service-level metrics, Alloy simplifies the process considerably.
Workloads exposing ServiceMonitors and PodMonitors are automatically detected and scraped via the prometheus.operator.servicemonitors and prometheus.operator.podmonitors, respectively, reducing operational overhead and avoiding the need for custom scrape configs for each service.
Kafka Implementation for Mimir Ingestion
The ingestion pipeline relies on Kafka as a durable buffer between Mimir's distributor and ingester within the write path.
As mentioned previously, the nodes available aren’t ideal for high disk throughput, which Kafka implicitly requires. To mitigate this, I tainted and labeled three nodes with workload/kafka to ensure that Strimzi schedules the Kafka brokers specifically on them. Reserving these nodes for Kafka prevents interference from other workloads, which is critical because disk I/O is the bottleneck—under heavy load, I’ve observed up to 20–30s of iowait, which is catastrophic for Kafka performance. Additionally, as mentioned in my article about implementing pod priority classes, I marked these workloads as high-priority, ensuring they would be the last to be evicted.
Disk capacity is also constrained. Each Kafka PVC is ~200 GB, and retention is limited to 3 hours. This aligns with Mimir’s fast ingestion rate: messages that aren’t consumed within that window would have limited value, while this configuration still leaves headroom for node failures or recovery scenarios.
Mimir Performance Tuning
Mimir’s default ingestion zone configuration sets three ingesters per zone, but with only 11 worker nodes—and three reserved for Kafka—this would have forced two ingesters onto the same node, making it critical from an I/O and resource perspective. To avoid this, I reduced the number of ingesters per zone to two, ensuring better distribution and reliability across the cluster.
The default CPU and memory requests and limits set by Mimir were also far too high for my hardware and budget. After testing, I reduced both RAM and CPU allocations by roughly 90%, which significantly lowers cluster resource pressure without impacting operational correctness. The trade-off is slower ingestion: writes are now approximately 1 s-2s p99 and 100 ms p50, compared to Mimir’s recommended 50 ms p99.
In practice, this is acceptable. Disk I/O remains the primary bottleneck, and Kafka plus Mimir’s ingester batching ensures that write queues don’t back up. The system remains fully operational and reliable, with performance tuned to the cluster’s real-world constraints rather than vendor defaults.
Alerting
Prometheus alerts are defined in Kubernetes as PrometheusRule CRDs, and Mimir ships with them enabled in its meta-monitoring configuration. Alloy automatically detects these rules and installs them using mimir.rules.kubernetes, while Mimir’s Alertmanager handles delivery. In this setup, alerts are configured to surface in Discord, providing an external record of firing events. Having alerts appear in an external system improves the user experience by capturing a specific timestamp for each firing—so there’s no need to retroactively ask whether an alert occurred.
Conclusion
Despite running in a constrained and occasionally unstable environment, this architecture has performed exactly as intended and has consistently met the availability, latency, and resilience goals outlined earlier. In practice, the system reliably ingests and serves hundreds of thousands of active series while remaining tolerant to node stalls, transient failures, and ingestion pressure. Additionally, it has since grown to accept metrics from Grafana Tempo, handling the additional load without any difference in quality of service.
Below is the main operational dashboard, showing overall system health live:
Backlinks