Graphite Series #3: Whisper Storage Schemas & Aggregations

In the Graphite Series blog posts, I'll provide a guide to help through all of the steps involved in setting up a monitoring and alerting system using a Graphite stack. Disclaimer: I am no expert, I am just trying to help the Graphite community by providing more detailed documentation. If there's something wrong, please comment below or drop me an email at feangulo@yaipan.com.


Whisper Storage Schemas & Aggregations

In the previous blog post, we installed Carbon and Whisper - the backend components of Graphite. We then spun up a carbon-cache process to listen for incoming data points and store them using Whisper. In this blog post I describe in more detail how Whisper stores the data points in the filesystem and how you can control these details.

Why does this matter?

There might be some confusion when you or your fellow developers and system administrators start publishing data points and get unexpected results:

  • Why are my data points getting averaged?
  • I've been publishing data points intermittently, why are there no data points?
  • I've been publishing data points for many days, why am I only getting data for one day?

How does Whisper store data?

We first need to understand how data is stored in the Whisper files. When a Whisper file is created, it has a fixed size that will never change. Within the file there are potentially multiple "buckets" for data points at different resolutions. For example:

  • Bucket A: data points with 10-second resolution
  • Bucket B: data points with 60-second resolution
  • Bucket C: data points with 10-minute resolution

Each bucket also has a retention attribute indicating the length of time data points in the bucket should be retained for. For example:

  • Bucket A: data points with 10-second resolution retained for 6 hours
  • Bucket B: data points with 60-second resolution retained for 1 day
  • Bucket C: data points with 10-minute resolution retained for 7 days

Given these two pieces of information, Whisper performs some simple math to figure out how many points it will actually need to keep in each bucket:

  • Bucket A: 6 hours x 60 mins/hour x 6 data points/min = 2160 points
  • Bucket B: 1 day x 24 hours/day x 60 mins/hour x 1 data point/min = 1440 points
  • Bucket C: 7 days x 24 hours/day x 6 data points/hour = 1008 points

If a Whisper file is created with this storage schema configuration, it will have a size of 56 KB. If you run it through the whisper-dump.py script, the following will be the output. Note that an archive corresponds to a bucket and the seconds per point and points attributes match our computations above.

Meta data:
  aggregation method: average
  max retention: 604800
  xFilesFactor: 0.5

Archive 0 info:
  offset: 52
  seconds per point: 10
  points: 2160
  retention: 21600
  size: 25920

Archive 1 info:
  offset: 25972
  seconds per point: 60
  points: 1440
  retention: 86400
  size: 17280

Archive 2 info:
  offset: 43252
  seconds per point: 600
  points: 1008
  retention: 604800
  size: 12096

What about the aggregations?

Aggregations come into play when data from a high precision bucket is moved to a lower precision bucket. Let's use Bucket A and B from our previous example.

  • Bucket A: 10-second resolution retained for 6 hours (higher precision)
  • Bucket B: 60-second resolution retained for 1 day (lower precision)

We might have an application publishing data points every 10 seconds. Any data points published less than 6 hours ago will be found in Bucket A. However, if I start to query for data points published more than 6 hours ago, they will be found in Bucket B.

How are data points moved to Bucket B? 

The lower precision value is divided by the higher precision value to determine the number of data points that will need to be aggregated.

  • 60 seconds (Bucket B) / 10 seconds (Bucket A) = 6 data points to aggregate

NOTE: Whisper needs the lower precision value to be cleanly divisible by the higher precision value (i.e. the division must result in a whole number). Otherwise the aggregation might not be accurate.

To aggregate the data, Whisper reads 6 10-second data points from Bucket A and applies a function to them to come up with the single 60-second data point that will be stored in Bucket B. There are five options for the aggregation functionaverage, sum, max, min and last. The choice of aggregation function depends on the data points you're dealing with. 95th percentile values, for example, should probably be aggregated with the max function. For counters, on the other hand, the sum function would be more appropriate.

Whisper also handles the concept of an xFilesFactor when aggregating data points. It represents the ratio of data points a bucket must contain to be aggregated accurately. In our previous example, Whisper determined that it needed to aggregate 6 10-second data points. It could be possible that only 4 data points have data and the other 2 are null - due to networking issues, application restarts, etc.

If our Whisper file has an xFilesFactor of 0.5, it means that it will aggregate the data only if at least 50% of the data points are present. If more than 50% of the data points are null, Whisper will create a null aggregation. In our case, we have 4 out of 6 data points - 66%. The aggregation function will be applied on the non-null data points to create the aggregated value.

You may set the xFilesFactor to any value between 0 and 1. A value of 0 indicates that the aggregation should be computed even if there is only one data point available. A value of 1 indicates that the aggregation should be computed only if all data points are present.


In the previous blog post, we made copies of all the example configuration files in the /opt/graphite/conf directory. The configuration files that control how Whisper files are created are:

  • /opt/graphite/conf/storage-schemas.conf
  • /opt/graphite/conf/storage-aggregation.conf

Default Storage Schemas

The storage-schemas configuration file is composed of multiple entries containing a pattern against which to match metric names and a retention definition. By default there are two entries: carbon and everything else.

The carbon entry matches metric names that start with the "carbon" string. Carbon processes emit their own internal metrics every 60 seconds - by default, but it can be changed. For example, a carbon-cache process will emit a metric for the number of metric files it creates every minute. The retention definition indicates that data points reported every 60 seconds will be retained for 90 days.

[carbon]
pattern = ^carbon\.
retentions = 60:90d

The everything else entry captures any other metric that is not carbon-related by specifying a pattern with an asterisk. The retention definition indicates that data points reported every 60 seconds will be retained for 1 day.

[default_1min_for_1day]
pattern = .*
retentions = 60s:1d

Default Storage Aggregation

The storage-aggregation configuration file is also composed of multiple entries containing:

  • a pattern against which to match metric names
  • an xFilesFactor value
  • an aggregation function

By default there are four entries:

  • Metrics ending in .min
    • Use the min aggregation function
    • At least 10% of data points should be present to aggregate
  • Metrics ending in .max
    • Use the max aggregation function
    • At least 10% of data points should be present to aggregate
  • Metrics ending in .count
    • Use the sum aggregation function
    • Aggregate if there is at least one data point
  • Any other metrics
    • Use the average aggregation function
    • At least 50% of data points should be present to aggregate
[min]
pattern = \.min$
xFilesFactor = 0.1
aggregationMethod = min

[max]
pattern = \.max$
xFilesFactor = 0.1
aggregationMethod = max

[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum

[default_average]
pattern = .*
xFilesFactor = 0.5
aggregationMethod = average

The default storage schemas and storage aggregations work well for testing, but for real production metrics you might want to modify the configuration files.

Modify Storage Schemas

First off, I'll modify the carbon entry. I'd like to keep the metrics reported by Carbon every 60 seconds for 180 days (6 months). After 180 days, I'd like to rollup the metrics to a precision of 10 minutes and keep those for another 180 days.

[carbon]
pattern = ^carbon\.
retentions = 1min:180d,10min:180d

My production and staging metrics are published every 10 seconds by the Coda Hale metrics library running inside the Dropwizard applications. I'd like to keep 10-second data for 3 days. After 3 days, the data should be aggregated to 1-minute data and kept for 180 days (6 months). Finally, after 6 months, the data should be aggregated to 10-minute data and kept for 180 days.

NOTE: If my metrics library published data points at a different rate, my retention definition would need to change to match it.

[production_staging]
pattern = ^(PRODUCTION|STAGING).*
retentions = 10s:3d,1min:180d,10min:180d

Metrics that are not carbon, production, or staging metrics are probably just test metrics. I'll keep those around only for one day.

[default_1min_for_1day]
pattern = .*
retentions = 60s:1d

Modify Storage Aggregation

I'm going to keep the default storage aggregation entries, but will add a couple more for metrics ending in ratio, m1_rate and p95.

NOTE: You need to add any new entries before the default entry.

[ratio]
pattern = \.ratio$
xFilesFactor = 0.1
aggregationMethod = average

[m1_rate]
pattern = \.m1_rate$
xFilesFactor = 0.1
aggregationMethod = sum

[p95]
pattern = \.p95$
xFilesFactor = 0.1
aggregationMethod = max

Congratulations! At this point you have configured your Graphite backend to match the data point publishing rates of your application and fully understand how the data points are stored in the filesystem. In the next blog post, we'll attempt to visualize the data using graphite-webapp.