Miscellaneous Resources on Gnocchi

Sep 14th, 2014

Like the title says, this post is a collection of links to resources that I found helpful for learning about Gnocchi.

Julien Danjou’s blog post about Gnocchi is the most recent of the links here and an excellent summary of all things Gnocchi. It explains the limitations of the previous API in Ceilometer and describes how the current implementation (using Pandas/Swift) was chosen. I found the diagrams useful for understanding the relationship between entities and resources. The post expands on the information given in the wiki, also a good reference.
Julien also did a walkthrough of the Gnocchi source code: very helpful for navigating the different parts of the project.
Gnocchi specs (I referred to this primarily for examples of the HTTP request syntax). If you look through the history, you can see the shift in approach to retention policies – the timeseries length goes from being defined in terms of number of points to being defined in terms of a lifespan.
Eoghan Glynn’s thread on the Openstack mailing lists clearly outlines the differences between Gnocchi and timeseries-oriented databases (like InfluxDB). It goes over the features of InfluxDB that make it a good backend option for Gnocchi though, such as the downsampling and retention policies.
The Gnocchi repository.

Notes on Stevedore

Jul 25th, 2014

This is mostly taken from the very helpful documentation on Stevedore; when I started working on Gnocchi I found myself wondering a lot about the functions of two modules in particular, stevedore and pecan. The other day I needed to use stevedore to load a plugin and so finally had the chance to use it in practice. These are some notes from the process – hopefully they can be useful for someone needing to use stevedore for the first time.

So my basic understanding of stevedore is that it is used for managing plugins, or pieces of code that you want to load into an application. The manager classes work with plugins defined through entry points to load and enable the code.

In practice, this is what the process looks like:

Create a plugin

The documentation, which is authored by Doug Hellman I believe, recommends making a base class with the abc module, as good API practice. In my case, I wanted to make a class that would calculate the moving average of some data. So my base class, defined in the init file of my directory (/gnocchi/statistics) looked like this:

import abc
import six

@six.add_metaclass(abc.ABCMeta)
class CustomStatistics(object):

    @abc.abstractmethod
    def compute(data):
    '''Returns the custom statistic of the data.'''

The code is implemented in the class MovingAverage (/gnocchi/statistics/moving_statistics.py):

from gnocchi import statistics

class MovingAverage(statistics.CustomStatistics):

  def compute(self, data):
      ... do stuff ...
      return averaged_data

Create the entry point

The next step is to define an entry point for the code in your setup.cfg file. The entry point format for the syntax is

plugin_namespace=
 name = module.path:thing_you_want_to_import_from_the_module

so I had

[entry_points]
gnocchi.statistics =
    moving-average = gnocchi.statistics.moving_statistics:MovingAverage

The stevedore documentation on registering plugins has more information on how to package a library in general usng setuptools.

Load the Plugins

You can either use drivers, hooks, or the extensions pattern to load your plugins. I ended up starting with drivers and then moving to extensions. The difference between them is whether you want to load a single plugin (use drivers) or multiple plugins at a time (extensions). I believe hooks also allows you to load many plugins at once but is meant to be used for multiple entry points with the same name. This allows you to invoke several functions with a single call…that’s about the limit of my knowledge on hooks.

The syntax for a driver is the following:

from stevedore import driver

mgr = driver.DriverManager(
    namespace='gnocchi.statistics',
    name='moving-average',
    invoke_on_load=True,
)

output  = mgr.driver.compute(data)

The invoke_on_load argument lets you call the object when loaded. Here the object is an instance of the MovingAverage class. You access it with the driver property and then call the methods (in this case, compute). You can also pass in arguments in DriverManager; see the documentation for more detail.

I ended up going with extensions instead of drivers, as there were multiple statistical functions I had as plugins and I wanted to load all the entry points at once. The syntax is then

  from stevedore import extension

  mgr = extension.ExtensionManager(
      namespace = 'gnocchi.statistics',
      invoke_on_load=True
  )

This loads all of the plugins in the namespace. In my case I wanted to make a dictionary of all the function names and the extension objects so I did:

configured_statistics = dict((x.name, x.obj) for x in mgr)

When a GET request to Gnocchi had a query for computing statistics on the data, the dict was consulted to see if there was a match with a configured statistics function name. If so, the extension object was called with the compute() method.

 output = configured_statistics[user_query].compute(data)

The documentation shows an example using map() to call all the plugins. For the code below results would be a sequence of function names and the resulting data once the statistic is applied :

def compute_data(ext, data):
    return (ext.name, ext.obj.compute(data))

results = mgr.map(compute_data, data)

If you need the order to matter when loading the extension objects, you can use NamedExtensionManager.

That’s about it for my notes on stevedore – it’s a clean, well-designed module and I’m glad I got to learn about it.

Options for Period-Spanning Statistics in Gnocchi

Jul 9th, 2014

Options for Period-Spanning Statistics in Gnocchi:

We would like to implement aggregation functions that span periods such as simple moving averages or exponentially weighted moving averages (EWMA). See the “Functions Defined” section at the end for descriptions of EWMA and Moving Averages. This is getting kind of long, so it’s probably enough to read A., B., Summary, and Functions Defined sections.

There are a few different options for implementing this, depending on whether Gnocchi ends up retaining raw data points or just doing roll-up, and how you deal with the prior history when calculating the period-spanning statistic. For the latter, there are mainly two strategies:

No prior history

Start at the the timestamp specified by the user, and aggregate forward in time. This will produce a lag between the aggregate and the data only for the moving average function; e.g. if the window size is defined as covering 7 periods, the moving average will first give a result for the 4th period (if you define the moving average to be centered). EDIT: Exponential smoothing does not produce a lag, even if the input parameter spanis larger than the array to be smoothed. However, the first element of the output EWMA is always the same as the first element input array, so your first period will have no smoothing applied. Also if span > len(inputarray), your smooth is usually a bad fit to the data.

Look-back

Incorporate the values from k bins before the start timestamp specified by the user, where 2k+1 is the window size of the moving average; for EWMA look back to span bins. The exponentially weighted moving average command in Pandas takes an input parameter of span, and double exponential smoothing (Holt-Winters) takes input parameters span and beta. For both cases we can relate span to the usual input parameter alpha via span=(2-alpha)/alpha.

Note: For the moving average statistic, if we only retain the aggregates and not raw data, we need to know the number of data points in each period (ie count needs to be available). If we retain all the data points, this is not necessary.

Static vs Free Parameters:

We can either define the window size/span (and for Holt-Winters) when we create the entity, or leave them as parameters that get specified in each query.

My picture of Gnocchi is that timestamps get acquired in time-order for the most part – basically you are usually getting more recent timestamps. For this model and for static parameters, you can calculate EWMA/MovA values as you fill up the period buckets. Originally, the idea was that even if periods expired, we could keep “ghostly remnants” (EWMA/MovA values) from the expired bins if needed to calculate new EWMA/MovA values for current bins, but this is unnecessary (I’ll ramble about why for the next few paragraphs).

Since EWMA only depends on the last forecast to make the next prediction, you generally only need to keep the EWMA values for two periods at a time – one EWMA value that gets updated as you add more timestamps to the current period, and an EWMA value that is fixed from the previous period, where no new timestamps are being added, but that is used to predict the EWMA of the current period. EDIT: For systems where there are n > 1 periods that are still being updated, you need to keep all those periods plus the first period that is locked-down (no new datapoints expected to arrive). For moving averages you do need to retain the last k average values of the periods before the start timestamp as well as the count for those periods. However, the moving average is not a recursive function, so you don’t actually depend on the moving average in previous periods to calculate the moving average in the current period. The only additional feature required to do the moving average that is not currently in Gnocchi would be saving the count. This is of course assuming we’re only saving aggregates – if we retain all datapoints, then you must only save the last k timestamp/value pairs.

In short, you don’t have to retain many EWMA/MovA values from old periods in order to calculate them for current periods (at most need to retain 1 old EWMA value for EWMA, don’t need any old MovA values for MovA, although you need to retain the mean and count of the k previous bins). If we assume k is less than the number of periods we generally retain, this means we never have to worry about needing to keep statistics from expired periods.

However, if you get some out of order timestamp that belongs in an older period, this will affect the moving average value for periods within window_size of the bin that holds the timestamp. The effect on EWMA values is more pervasive – probably negligible for for periods > span away from the re-updated bin. In this case you need to keep EWMA values from periods before the modified bucket in order to recalculate the EWMA. If your data is coming in in a completely non-time ordered way, there is no sense in the above scheme for calculating the EWMA/MovA values for bins as you go, since you’ll just have to recalculate everything as new points come in for old time buckets. For that case, you should use queries.

Queries: Calculate the aggregation on demand without storing partial results. If we define the window size as the number of periods it spans, a query would look something like:

GET /v1/entity/<UUID>/measures?start='2014-06-21 23:23:20’&stop='2014-06-22 23:23:20’&aggregation=ewma&span=20

GET /v1/entity/<UUID>/measures?start='2014-06-21 23:23:20’&stop='2014-06-22 23:23:20’&aggregation=mva&window=5

GET /v1/entity/<UUID>/measures?start='2014-06-21 23:23:20’&stop='2014-06-22 23:23:20’&aggregation=holtwinters&span=5&beta=.1

Adding scoped names might be useful as well (ie aggregation.window, aggregation.beta)

Handling boundaries: If you specify a start timestamp in the query that’s expired, start from the data that does exist. If you specify a start timestamp that needs data points from expired bins to calculate MovA/EWMA (basically you chose the oldest non-expired period), choose Strategy A of no prior history, and aggregate starting from there to the stop timestamp. Strategy B doesn’t really work here since you don’t know how many bins to retain as spanand k are not defined until the query is proposed. Another way would be to not allow the user to request Moving Average on timestamps unless the the window size is such that you don’t run into expired bins (or truncate the window size so that condition is met). If we count the oldest non-expired bin as b0 and if the start timestamp in the query t_i is in bin b_i, the condition would be that the window-size < 2*i. For EWMA since it returns values no matter what size the span is, probably going with Strategy A is easiest.

We could also implement a combination of both static parameters and query functionality. That is, specify window size, span, and beta upfront when creating the entity. Then if you later want say a different window size, do a query.

Summary:

If your data is not time ordered, you should only implement query functionality.

If your data is time ordered, you can implement static parameters (meaning you define window size/span/beta upon entity creation). Using static parameters avoids the problem of needing data from bins that have expired.

Whatever you decide to implement, be explicit in the documentation about whether Moving-Average[value, window_size] is centered on the value or not. Centering makes more sense to me conceptually, but up to the user.

In general, Strategy A of no prior history is simpler than Strategy B of taking old bins into account. In Strategy B you have to do more checking to see if the old bins aren’t expired, etc. Probably easiest to do Strategy A.

Right now the period-spanning statistics need the mean and count of each period. Since you can re-adjust the mean of a bin when you add new points in that time bucket by doing

(old-mean*old-count+new-value)/(old-count+1)

Really what Moving Average is doing to calculate the moving average over 3 periods is:

MovingAverage(3 bins)=(mean1count1+mean2count2+mean3*count3)/(count1+count2+count3)

so you don’t have to retain all data points, just the mean and count (which can be updated correctly even if new timepoints come in for old buckets).

Of course if you retained all the data points, it would be equivalent to do

MovingAverage(3 bins) = Sum(values in 3 bins wide interval)/Number of points

Notice that for count1=count2=count3the moving average is equal to the average of the averages:

MovingAverage(3bins) = (mean1+mean2+mean3)/3

In the end, it comes down to what the data is expected to look like. If you expect almost uniform sampling, then taking the average of averages should be good enough. If however, you expect 300 bins in one period and 2 in another and hugely noisy data…I can’t say for sure that the average of averages is worthless, but I think in most cases it would be a worse estimate of the mean of the k bins than when weighting by count.

The EWMA is a bit trickier. Technically what we are doing here is an exponential smooth over the mean values of the bins (if all we retain are the mean/count). E.g.:

EWMA(1stbin)=mean1 EWMA(2ndbin)=(1-alpha)EWMA(1stbin)+alpha*mean2 etc… where alpha =2/(1+span)

In order to do an EWMA on the raw data, you do need all the data points to be retained.

Summary of the Summary:

The question seems to be, do we retain all data points or not?

Right now, you can do period-spanning statistics on the aggregated-data (where we’re only looking at period-spanning averages). It requires some work, since you have to be careful about old data points changing the aggregated mean value in old periods, and really you’re only doing the EWMA on the means, not the actual raw data. Also, the window size/span are defined as the number of periods to span, not the number of raw data points, which could be confusing. However, you’re not storing all the data points, which seems to be the intention.

If you save all the data points, doing period-spanning statistics is conceptually more straightforward. You don’t have to define the window size in terms of periods, and EWMA can be done on the raw data. Moreover, you can do other period-spanning statistics that were just impossible with roll-up (period-spanning stddevs, etc). However, you’re now retaining all the data points, which seems to be something Gnocchi was trying to avoid. Functions Defined:

Just to define what we mean here, the output of a moving average reads: y[n] = 1/(2k+1)*sum(x[n-k:n+k+1])

where the odd number 2k+1is the window size and x[t]the input array (in our case, each element in x corresponds to the mean*count of a period). Increasing the window size increases the smoothing effect. In Pandas the moving average syntax is: y = pandas.rolling_mean(x, window_size, center=True)

If we retain all datapoints, we can call pd.rolling_mean on the raw data. If we are doing roll-up, you can’t use this syntax, but need to implement a moving_average that takes as input arrays mean, count corresponding to the periods being looked at:

moving_average(window_size=# of periods to aggregate over, mean, count)

The output of an exponentially weighted moving average is:

y[t] = x[t]+(1-)y[t-1], t>0 y[0]=x[0] y = pandas.ewma(x, com= (1-alpha)/alpha) or y = pandas.ewma(x, span = (2-alpha)/alpha)

As pandas.ewma takes in span as a parameter, rather than alpha, it is perhaps clearer to say that alpha=2/(1+span).alpha is a smoothing parameter between 0 and 1; for small alpha, weights decay quickly for old data points and they “fall off” in affecting the forecast. For large alpha, weights decay more slowly and many data points contribute to the forecast.

First Post

May 24th, 2014

I am a graduate student in physics at Yale; this summer I’m excited to be doing an internship with Openstack through the Outreach Program for Women. This blog will be describing the work I do for this internship, working on Ceilometer, Openstack’s metering service. My Github code is here and you can email me at this address.

Opw Project

May 24th, 2014

My project for the OPW internship will be to add support for a class of statistical functions in Ceilometer, Openstack’s metering service. The functions I’m looking at are generally used when looking for trends over time, so things like moving averages and exponential smoothing. The initial blueprint for implementing these functions envisioned using the current statistics API and working with already supported drivers: mongodb and sqlalchemy. However, since the blueprint went out, there’s been work on a new project Gnocchi (see here as well), which both sounds delicious and has an exciting goal, which is to rethink the way Ceilometer stores and aggregates metrics. Right now the plan for Gnocchi is to implement data storage using Pandas and Swift; another option is to use a database optimized for time-series data, like InfluxDB. This project is pretty new, so there are a lot of details that need to be ironed out, but ultimately using Gnocchi will lead to better performance and scalability. My project has now shifted focus to looking at how to implement the statistical functions, described in the blueprint, for Gnocchi.

Blog Archives