Logging
Logs are a core component of observability. They provide detailed insight into application behavior, system events, and failure modes that are often difficult to understand through metrics or traces alone.
In the course of operating a bare-metal Kubernetes cluster, I needed a reliable way to collect and centralize logs from both cluster infrastructure and running workloads. This required designing a logging pipeline that could ingest logs from multiple nodes, process them consistently, and store them in a system suitable for querying and analysis.
This article describes the approach I used to implement log ingestion for that environment, along with the design considerations and trade-offs involved.
Goals
The logging solution needed to meet several practical requirements for operating a Kubernetes cluster in production:
- Automatic coverage: Logs should be collected from every pod running in the cluster without requiring additional configuration per application. New workloads should automatically have their logs ingested as they are deployed.
- Observable logging pipeline: The logging system itself should expose metrics and health signals so that failures, backlogs, or degraded performance can be detected and investigated.
- Resilience to temporary outages: If the logging backend becomes temporarily unavailable, logs should be buffered and delivered once connectivity is restored rather than being lost during the outage window, where feasible.
- Human-friendly querying Logs should include Kubernetes metadata so they can be queried and filtered by dimensions such as namespace, pod, and container. This allows operators to quickly narrow investigations to relevant workloads or environments.
Constraints
The cluster environment imposed a few important constraints that influenced the logging design. The constraints are fundamentally the same as outlined in my article about implementing a metrics pipeline. To reiterate, low node headroom is the primary concern, followed by high iowait times.
Technology Selection
Several logging backends are common in Kubernetes environments, including ELK/EFK, Loki, and VictoriaLogs.
Elasticsearch was ruled out due to its high resource requirements and indexing model designed for broad log search, which is unnecessary for typical Kubernetes queries that already narrow results by metadata and time.
Given the choice between Loki and VictoriaLogs, I opted to use Loki, as I was already deeply familiar with how it works under-the-hood regarding indexes vs structured metadata and unpacking at query time, given my work in enabling queryable logs in a Docker-based environment.
The log collector is Grafana Alloy. Alloy was already deployed in the cluster to handle metrics scraping and forwarding, so extending it to also tail Kubernetes container logs was a natural choice. Using the same agent for both metrics and logs keeps the operational surface area small and avoids introducing an additional node-level component solely for log collection.
Implementation
One useful characteristic of Grafana Alloy is that its configuration can be hot-reloaded. Changes to log processing behavior can therefore be applied without restarting the agent across the cluster. This allows adjustments to be made quickly when investigating production issues.
In practice, this allows pods to be run with debug logs active, simply dropping them at collection time with a configuration such as:
stage.match {
selector = "{level=\"debug\"}"
action = "drop"
}
If an issue occurs, the configuration can be adjusted so that those logs are temporarily retained and shipped to the backend, without requiring additional time for pod restarts to produce the desired debug logs.
The trade-off is that this filtering happens after the log line has already been read by the collector. As a result, this increases network resources consumed in the collector pathway, which was deemed acceptable due to the decrease in MTTR it affords when investigating issues that might only be surfaced with debug logs.
Monitoring / Alerting
Monitoring and alerting for the logging stack required very little additional configuration because of the previously-mentioned existing Mimir and Alloy metrics pipeline.
To recap, Alloy is responsible for discovering PrometheusRule resources in the cluster and loading their alerting and recording rules. This mechanism was originally implemented to support metrics ingestion for Grafana Mimir, allowing rule definitions to be managed through standard Kubernetes resources.
The Loki Helm chart includes a set of pre-defined monitoring rules that cover common operational concerns such as ingestion failures, component restarts, and storage backpressure. When monitoring is enabled in the chart, these rules are exposed automatically as PrometheusRule resources.
Because Alloy was already configured to discover these resources cluster-wide, the Loki alerts were automatically loaded into the existing monitoring pipeline without requiring any additional configuration. As a result, the logging stack immediately benefited from a baseline set of operational alerts with essentially zero additional integration work.
Conclusion
Building a reliable logging pipeline for a bare-metal Kubernetes cluster requires careful attention to operational constraints, resource limitations, and maintainability. By selecting Loki as the backend and leveraging Grafana Alloy for log collection, it was possible to achieve automatic, cluster-wide log ingestion without adding operational complexity or requiring changes to application workloads.
The combination of metadata-aware log processing, hot-reloadable collector configuration, and integration with the existing metrics and alerting system provides a robust, observable solution. This approach ensures that logs are available when needed for troubleshooting, while minimizing resource overhead and administrative effort.
Overall, the solution demonstrates that with careful design and thoughtful tool selection, a production-grade logging stack can be both lightweight and fully integrated into existing Kubernetes observability workflows.