How to Choose MAX vs LAST vs AVG for Metric Rollups

Forum|Forum|1 month ago
April 29, 2026
3 replies
131 views

Emily Ashley
Employee

This is a practical walkthrough for choosing rollup functions like MAX, LAST, and AVG when you aggregate gauges from short sample intervals (for example, 2 seconds) into longer windows (for example, 30 seconds) with our Rollup Metrics Function.

We’ll use Cribl Internal Metrics as examples, but the reasoning applies to any gauges in your environment.

1. Start with the question, not the function

Before you touch any rollup settings, ask: what question am I trying to answer with this gauge?

“Did this ever reach an interesting state in this window?”
- Examples: “Did anything go red?” “Did throttling ever turn on?”
“What is it right now at the end of the window?”
- Examples: “What’s the current health?” “What’s CPU right now?”
“What was it typically doing over this period?”
- Examples: “What’s the average CPU over 5 minutes?” “What’s our normal queue size?”

2. Classify the gauge: three useful buckets

Next, decide what kind of gauge you’re dealing with. We’ll anchor in Cribl Internal Metrics and then generalize.

A. Ordinal health gauges

health.inputs
health.outputs

These usually use a small set of integer states, such as:

0 = green
1 = warning
2 = trouble

They’re stateful gauges: the numeric value is really “which state are we in?” with an ordering. Fractional “in‑between” values like 0.75 health don’t mean much to a human.

General pattern: any gauge where the value is a discrete state (OK/WARN/TROUBLE, etc.) belongs here.

B. Binary / boolean gauges

blocked.outputs

These are 0/1 gauges that say whether something is happening or not.

General pattern: toggles, switches, feature flags, “is this on?”

C. Continuous gauges

system.cpu_perc
system.load_avg
system.mem_rss
pq.queue_size

These are numeric values where “in‑between” points are meaningful (52% CPU really is between 50% and 55%).

General pattern: resource usage, queue depths, latencies, utilization, and similar signals.

3. Choosing rollups for each gauge type

Now combine the question you’re asking with the type of gauge you have.

A. Ordinal health gauges – “what state did we reach?”

health.inputs
health.outputs

They use discrete states (0/1/2). In practice, people look at them to answer:

“How healthy was this over time?”
“Did we hit warning or trouble at any point in that period?”

A rollup of MAX lines up nicely with that mental model:

If health ever went to 2 (trouble) within the window, the rolled‑up value is 2.
If it only reached 1 (warning), the rolled‑up value is 1.
If it stayed at 0, the rolled‑up value is 0.

That preserves the highest severity reached in each window, which is exactly what most humans expect from a health graph.

General guidance for ordinal health gauges (any system):

If you care about “highest severity in the window,” use MAX.
Avoid AVG – intermediate values like 0.7 or 1.3 are hard to interpret.
Use LAST only when the specific need is “what state did we end in?”

You can extend this to other internal health‑style gauges (for example, a pq.health gauge that uses a similar ordinal scheme).

B. Binary / boolean gauges – “ever true” vs “currently true”

For Cribl‑style boolean gauges like blocked.outputs, there are two natural ways to read them:

Ever true in the window?
- Example: “Did throttling happen at all in the last N seconds?”
- Here, MAX works well: if the gauge was ever 1, the rolled‑up value is 1.
Currently true at the end of the window?
- Example: “Is throttling on right now?”
- Here, LAST makes more sense: you want the final observed state.

You can use AVG and read it as “percentage of samples that were true,” but that’s more of a specialized use case and should be intentional.

General guidance for boolean gauges:

“Ever happened in this window?” → lean MAX.
“Is it happening now?” → lean LAST.
“What fraction of time was it true?” → AVG, but only when you explicitly want that.

C. Continuous gauges – “now, typical, or worst?”

For continuous gauges like system.cpu_perc or pq.queue_size, the “right” rollup depends on what story you want your panel to tell.

Using Cribl Internal Metrics as concrete examples:

“What is it now?” views
- Example: a dashboard tile showing current CPU or queue depth.
- LAST is a natural choice: you want the most recent value at the end of the window.
“What’s typical over time?” views
- Example: 5‑minute average CPU, usual queue length.
- AVG is often a good fit: it gives a smoothed sense of “normal.”
“Did it ever spike?” views
- Example: “Did CPU reach 95% or higher in this interval?”
- MAX answers that directly by showing you the peak.

General guidance for continuous gauges:

Use LAST when you care most about the value at the end of the window (“right now”).
Use AVG when you want the typical level across the window.
Use MAX when you care about peaks and threshold crossings (“did it ever exceed X?”).

FINALE! Putting it all together with a checklist.

When you’re configuring rollups for Cribl Internal Metrics gauges—or any other gauges—this quick checklist tends to work well:

What am I really asking about this gauge?
- “Did it ever reach state X?” → lean MAX.
- “What is it right now?” → lean LAST.
- “What’s typical over this period?” → lean AVG (for continuous gauges).
What type of gauge is it?
- Ordinal health (like health.inputs/outputs)
  - Treat the values as states; MAX is usually a good fit for “highest severity reached.”
- Boolean (like blocked.outputs)
  - MAX for “ever true,” LAST for “currently true”
  - AVG only for “fraction of time true” use cases
- Continuous (like CPU, queue size)
  - Choose LAST/AVG/MAX based on whether you care about “now,” “typical,” or “worst.”
Does the rolled‑up gauge tell the story I think it tells?
- Look at a few real examples (including Cribl Internal Metrics) and do a gut-check that what’s drawn in the graph matches your mental picture of what happened during that time window.

If the story and the numbers line up, you’ve picked a good rollup for that gauge—and the same reasoning will keep working as you add more metrics over time.

P.S. A quick note on gauges vs counters in Cribl Internal Metrics

Cribl Internal Metrics include both gauges and counters. We’ve focused on gauges byt you may be wondering about Counters. Counters are rolled up with their own logic.

With Cribl Internal Metrics they are delta values per reporting interval. Generally when rolling up metrics, you can either sum those deltas or convert them to rates. In our Rollup Metrics implementation, we sum them within each time window.

F

flo
New Participant
Forum|Forum|1 month ago
May 11, 2026

Thank you :)
Example logs (and maybe some screenshots and cfg jsons) at the end of the article would be very helpful
I receive the events including health information, but I am missing the throttle.engaged,
blocked.outputs, and System (like CPU etc) events. Is this doe to the not activated “Full fidelity” in the cribl metric log source?
A tiny alert in spl which some might be useful:

index=cribl-* health "endpoints{}.stats.health"!=0 | spath endpoints{}.stats.health | eval cribl-health=case('endpoints{}.stats.health'==0, "green", 'endpoints{}.stats.health'==1, "warning", 'endpoints{}.stats.health'==2, "trouble", true(), "unknown") ```based on https://knowledge.cribl.io/guides-tutorials-15/how-to-choose-max-vs-last-vs-avg-for-metric-rollups-2219 A. Ordinal health gauges ==> health.inputs OR health.outputs These usually use a small set of integer states, such as: 0 = green 1 = warning 2 = trouble ``` | table _time,host,source,channel,ioName,ioType,cribl-health,level,endpoints{}.stats.error.message,message | sort - _time | head 100

Like

Emily Ashley
Author
Employee
Forum|Forum|1 month ago
May 11, 2026

Oh! Good catch, and thanks so much for calling that out. I don’t think I should have included `throttle.engaged` as a Cribl Internal metric this example. I’ll see if I can edit the post accordingly.

blocked.outputs / backpressure.outputs and system.cpu_perc are standard Cribl Internal Metrics and should be available to you - so I’d be interested in seeing what is happening there (visible in Live Data view? Not making it downstream?)

Like

Emily Ashley
Author
Employee
Forum|Forum|1 month ago
May 11, 2026

As for the Full Fidelity question,

Full fidelity keeps the higher-cardinality field metrics for dimensions like host, index, project, source, and sourcetype, so instead of only sending rolled-up totals, Cribl also emits additional per-dimension metric series for those fields.

So with Full fidelity off, you still get aggregate metrics like cribl.logstream.total.in_events, which represent the rolled-up inbound event count.

With Full fidelity on, Cribl also emits additional field-specific series such as:

cribl.logstream.host.in_events = inbound events for a given event host value
cribl.logstream.source.in_events = inbound events per source
cribl.logstream.sourcetype.in_events = inbound events per source type

So the practical difference is:

Full fidelity off: “How many events came in overall?”
Full fidelity on: “How many events came in overall, and how were they broken out by host, index, project, source, and sourcetype?”

You can see how that would be much more resource intensive - but also some cool data!

Like

1. Start with the question, not the function

2. Classify the gauge: three useful buckets

A. Ordinal health gauges

B. Binary / boolean gauges

C. Continuous gauges

3. Choosing rollups for each gauge type

A. Ordinal health gauges – “what state did we reach?”

B. Binary / boolean gauges – “ever true” vs “currently true”

C. Continuous gauges – “now, typical, or worst?”

FINALE! Putting it all together with a checklist.

Sign up

Using your Cribl Curious or University Account

Login to the community

Using your Cribl Curious or University Account

Scanning file for viruses.

This file cannot be downloaded