Prometheus Query Language (PromQL) enables precise extraction and manipulation of time-series data for monitoring and alerting. Beyond basic selection and aggregation, advanced use cases require a deeper understanding of operators, functions, and query design patterns. This article examines techniques to construct complex queries, optimize performance, and address specific analytical challenges.
Aggregation Operators and Vector Matching
Aggregation operators (sum
, avg
, max
) collapse multiple time series into a single series by grouping dimensions. While basic usage involves by
or without
clauses, advanced scenarios require explicit control over vector matching.
For example, to calculate the 90th percentile latency across services while preserving the cluster
label:
quantile(0.9, histogram_quantile(0.9, rate(http_request_duration_seconds_bucket[5m])) by (service, cluster))
Here, histogram_quantile
computes per-service latency quantiles, which are then aggregated across services using quantile
while retaining the cluster
label.
When combining metrics with different labels, use group_left
or group_right
to specify which side of the operation provides additional labels:
(container_memory_usage_bytes{container="app"} * on (pod) group_left(node) kube_pod_info)
This multiplies memory usage by pod information, explicitly allowing the right-hand vector (kube_pod_info
) to contribute the node
label.
Mathematical Functions and Transformations
PromQL supports arithmetic, logarithmic, and exponential functions to transform data. For instance, ln
and exp
are useful for analyzing exponential growth patterns:
ln(increase(http_requests_total[1h]))
This calculates the natural logarithm of the hourly request growth rate, linearizing multiplicative trends for easier comparison.
The clamp_min
and clamp_max
functions constrain metric values to specified ranges. To ensure disk usage percentages are interpreted correctly:
clamp_min(clamp_max(node_filesystem_usage_percent, 100), 0)
This query caps values at 0 and 100, handling outliers caused by filesystem mounting artifacts.
Filtering and Logical Operations
Filtering time series based on dynamic thresholds or label values requires combining comparison operators with logical operators. For example, to identify services with error rates exceeding 5% and request rates below 100 RPM:
(rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])) > 0.05
and
rate(http_requests_total[5m]) < 100
The and
operator ensures both conditions are met.
To exclude time series with specific label combinations, use unless
:
up{job="api"}
unless
(up{env="staging"} or up{instance=~"backup.*"})
This returns healthy api
job instances, excluding those in staging or matching the backup.*
instance pattern.
The bool
modifier converts comparison results to 1
(true) or 0
(false), enabling arithmetic operations on boolean outcomes:
sum(rate(http_errors_total[5m]) > bool 0) by (service)
This counts the number of services with non-zero error rates.
Subqueries and Nested Evaluation
Subqueries execute an inner query at a higher resolution over a specified time range. They are particularly useful for computing aggregations over sliding windows. For example, to determine the maximum hourly average request latency over a day:
max_over_time(
avg(rate(http_request_duration_seconds_sum[5m]))
/
avg(rate(http_request_duration_seconds_count[5m]))[1h:5m]
)
The subquery [1h:5m]
calculates the hourly average using 5-minute intervals. The outer max_over_time
identifies the peak value.
Subqueries can significantly increase computational load. To mitigate this, align the subquery resolution (the third argument) with the step interval where possible:
quantile_over_time(0.95, latency_seconds[1h:10s] @ end())
Here, the subquery evaluates at 10-second resolution, reducing unnecessary data points.
Rate vs. Irate: Handling Counter Resets
The rate
and irate
functions calculate per-second increments of counters, but differ in handling sparse or volatile data.
rate
computes the average slope over the entire lookback window, smoothing transient spikes. Use it for alerting on sustained increases:
rate(node_network_receive_bytes_total[2m]) > 1e6
irate
uses the last two data points within the lookback window, capturing sudden changes. It is suited for debugging real-time traffic:
irate(node_network_receive_bytes_total[5m]) > 5e6
When counters reset (e.g., due to process restarts), both functions automatically handle decreases by interpreting them as counter resets. However, irate
may produce misleading results if the reset occurs between the last two samples.
Predictive Functions and Forecasting
The predict_linear
function forecasts future metric values using linear regression. To estimate when a disk will be full:
predict_linear(node_filesystem_avail_bytes[1h], 3600 * 4) <= 0
This predicts disk space exhaustion within four hours based on the last hour’s trend. The function fits a linear model to the data, extrapolating the slope.
Note that predict_linear
assumes trends remain constant. Abrupt changes in usage patterns (e.g., log rotation) can invalidate predictions.
Optimizing Query Performance
-
Minimize Range Selectors: Restrict the range in square brackets (
[5m]
) to the smallest necessary interval. Larger ranges increase memory and CPU usage. -
Avoid Unnecessary Sorting: Functions like
topk
sort results, which can be resource-intensive. Filter data first:
topk(3, sum(rate(container_cpu_usage_seconds_total[5m])) by (container) > 0)
- Precompute with Recording Rules: Move frequently used aggregations or transformations into recording rules to reduce query-time overhead.
Conclusion
Advanced PromQL techniques enable precise analysis of time-series data, from dynamic aggregations to predictive forecasting. Mastery of vector matching, subqueries, and function selection ensures accurate results while maintaining system performance. As queries grow in complexity, prioritize testing against historical data to validate behavior under edge cases such as counter
For more technical blogs and in-depth information related to Platform Engineering, please check out the resources available at "https://www.improwised.com/blog/".