Julien Danjou’s blog post about Gnocchi is the most recent of the links here and an excellent summary of all things Gnocchi. It explains the limitations of the previous API in Ceilometer and describes how the current implementation (using Pandas/Swift) was chosen. I found the diagrams useful for understanding the relationship between entities and resources. The post expands on the information given in the wiki, also a good reference.
Julien also did a walkthrough of the Gnocchi source code: very helpful for navigating the different parts of the project.
Gnocchi specs (I referred to this primarily for examples of the HTTP request syntax). If you look through the history, you can see the shift in approach to retention policies – the timeseries length goes from being defined in terms of number of points to being defined in terms of a lifespan.
Eoghan Glynn’s thread on the Openstack mailing lists clearly outlines the differences between Gnocchi and timeseries-oriented databases (like InfluxDB). It goes over the features of InfluxDB that make it a good backend option for Gnocchi though, such as the downsampling and retention policies.
The Gnocchi repository.
Finally, I am also adding some old notes I made when first trying to understand the Gnocchi source code. I found it helpful to visually map out the filesystem structure (shown below) with comments on the different functional parts underneath. The diagram is not up to date and shows only the sections I was most familiar with. Although the comments below are rather simplistic and a bit embarrassing in retrospect, I thought I would post this in case a future OPW intern wants to know how other people approached their projects.
gnocchi/gnocchi/
|--- carbonara.py
|--- cli.py
|--- indexer/
| |--- __init__.py
| |--- null.py
| |--- sqlalchemy.py
|--- __init__.py
|--- openstack/
|--- rest/
| |--- app.py
| |--- __init__.py
|--- service.py
|--- storage/
| |--- __init__.py
| |--- null.py
| |--- swift.py
|--- tests/
| |--- __init__.py
| |--- test_carbonara.py
| |--- test_indexer.py
| |--- test_rest.py
| |--- test_storage.py
|--- gnocchi.egg-info
|--- openstack-common.conf
|--- README.rst
|--- requirements.txt
|--- run-mysql-tests.sh
|--- run-postgresql-tests.sh
|--- run-tests.sh
|--- setup.cfg
|--- setup.py
carbonara.py is the library that manipulates the timeseries. It is used in conjunction with the Swift storage driver.
rest/ – I spent a lot of time here, mostly in the EntityController get_measures section of the init file, which has the code for evaluating HTTP requests to the API. The app.py file uses pecan magic to create hooks for the configured storage drivers and indexers. It also sets the default port for the API.
storage/ – also looked at these files a lot when starting out. The driver-specific code is contained here. Eoghan Glynn added support for InfluxDB in this patch; Dina Belova added OpenTSDB support. As it was desirable to make the custom aggregation functions driver-independent, no changes were made to these files. However, it was useful to look at the InfluxDB patch to get a sense of the way aggregation was implemented using that backend.
tests/ – The second most helpful thing in understanding the code (other than actually running the API and seeing query results, as well as logging intermediate steps in debug mode) was to look at the test cases.
README.rst – I think this was my first contribution. Eoghan suggested it as a first step and it was a great way for getting my feet wet.
setup.cfg – modify this file to add entry points for custom aggregation functions, storage drivers, and indexer options.
That’s all I have for comments on the Gnocchi source code – thanks for reading!
]]>So my basic understanding of stevedore is that it is used for managing plugins, or pieces of code that you want to load into an application. The manager classes work with plugins defined through entry points to load and enable the code.
In practice, this is what the process looks like:
The documentation, which is authored by Doug Hellman I believe, recommends making a base class with the abc module, as good API practice. In my case, I wanted to make a class that would calculate the moving average of some data. So my base class, defined in the init file of my directory (/gnocchi/statistics) looked like this:
import abc
import six
@six.add_metaclass(abc.ABCMeta)
class CustomStatistics(object):
@abc.abstractmethod
def compute(data):
'''Returns the custom statistic of the data.'''
The code is implemented in the class MovingAverage (/gnocchi/statistics/moving_statistics.py):
from gnocchi import statistics
class MovingAverage(statistics.CustomStatistics):
def compute(self, data):
... do stuff ...
return averaged_data
The next step is to define an entry point for the code in your setup.cfg file. The entry point format for the syntax is
plugin_namespace=
name = module.path:thing_you_want_to_import_from_the_module
so I had
[entry_points]
gnocchi.statistics =
moving-average = gnocchi.statistics.moving_statistics:MovingAverage
The stevedore documentation on registering plugins has more information on how to package a library in general usng setuptools.
You can either use drivers, hooks, or the extensions pattern to load your plugins. I ended up starting with drivers and then moving to extensions. The difference between them is whether you want to load a single plugin (use drivers) or multiple plugins at a time (extensions). I believe hooks also allows you to load many plugins at once but is meant to be used for multiple entry points with the same name. This allows you to invoke several functions with a single call…that’s about the limit of my knowledge on hooks.
The syntax for a driver is the following:
from stevedore import driver
mgr = driver.DriverManager(
namespace='gnocchi.statistics',
name='moving-average',
invoke_on_load=True,
)
output = mgr.driver.compute(data)
The invoke_on_load argument lets you call the object when loaded. Here the object is an instance of the MovingAverage class. You access it with the driver property and then call the methods (in this case, compute). You can also pass in arguments in DriverManager; see the documentation for more detail.
I ended up going with extensions instead of drivers, as there were multiple statistical functions I had as plugins and I wanted to load all the entry points at once. The syntax is then
from stevedore import extension
mgr = extension.ExtensionManager(
namespace = 'gnocchi.statistics',
invoke_on_load=True
)
This loads all of the plugins in the namespace. In my case I wanted to make a dictionary of all the function names and the extension objects so I did:
configured_statistics = dict((x.name, x.obj) for x in mgr)
When a GET request to Gnocchi had a query for computing statistics on the data, the dict was consulted to see if there was a match with a configured statistics function name. If so, the extension object was called with the compute() method.
output = configured_statistics[user_query].compute(data)
The documentation shows an example using map() to call all the plugins. For the code below results would be a sequence of function names and the resulting data once the statistic is applied :
def compute_data(ext, data):
return (ext.name, ext.obj.compute(data))
results = mgr.map(compute_data, data)
If you need the order to matter when loading the extension objects, you can use NamedExtensionManager.
That’s about it for my notes on stevedore – it’s a clean, well-designed module and I’m glad I got to learn about it.
]]>We would like to implement aggregation functions that span periods such as simple moving averages or exponentially weighted moving averages (EWMA). See the “Functions Defined” section at the end for descriptions of EWMA and Moving Averages. This is getting kind of long, so it’s probably enough to read A., B., Summary, and Functions Defined sections.
There are a few different options for implementing this, depending on whether Gnocchi ends up retaining raw data points or just doing roll-up, and how you deal with the prior history when calculating the period-spanning statistic. For the latter, there are mainly two strategies:
Start at the the timestamp specified by the user, and aggregate forward in time. This will produce a lag between the aggregate and the data only for the moving average function; e.g. if the window size is defined as covering 7 periods, the moving average will first give a result for the 4th period (if you define the moving average to be centered). EDIT: Exponential smoothing does not produce a lag, even if the input parameter spanis larger than the array to be smoothed. However, the first element of the output EWMA is always the same as the first element input array, so your first period will have no smoothing applied. Also if span > len(inputarray), your smooth is usually a bad fit to the data.
Incorporate the values from k bins before the start timestamp specified by the user, where 2k+1 is the window size of the moving average; for EWMA look back to span bins. The exponentially weighted moving average command in Pandas takes an input parameter of span, and double exponential smoothing (Holt-Winters) takes input parameters span and beta. For both cases we can relate span to the usual input parameter alpha via span=(2-alpha)/alpha.
Note: For the moving average statistic, if we only retain the aggregates and not raw data, we need to know the number of data points in each period (ie count needs to be available). If we retain all the data points, this is not necessary.
Static vs Free Parameters:
We can either define the window size/span (and for Holt-Winters) when we create the entity, or leave them as parameters that get specified in each query.
My picture of Gnocchi is that timestamps get acquired in time-order for the most part – basically you are usually getting more recent timestamps. For this model and for static parameters, you can calculate EWMA/MovA values as you fill up the period buckets. Originally, the idea was that even if periods expired, we could keep “ghostly remnants” (EWMA/MovA values) from the expired bins if needed to calculate new EWMA/MovA values for current bins, but this is unnecessary (I’ll ramble about why for the next few paragraphs).
Since EWMA only depends on the last forecast to make the next prediction, you generally only need to keep the EWMA values for two periods at a time – one EWMA value that gets updated as you add more timestamps to the current period, and an EWMA value that is fixed from the previous period, where no new timestamps are being added, but that is used to predict the EWMA of the current period. EDIT: For systems where there are n > 1 periods that are still being updated, you need to keep all those periods plus the first period that is locked-down (no new datapoints expected to arrive). For moving averages you do need to retain the last k average values of the periods before the start timestamp as well as the count for those periods. However, the moving average is not a recursive function, so you don’t actually depend on the moving average in previous periods to calculate the moving average in the current period. The only additional feature required to do the moving average that is not currently in Gnocchi would be saving the count. This is of course assuming we’re only saving aggregates – if we retain all datapoints, then you must only save the last k timestamp/value pairs.
In short, you don’t have to retain many EWMA/MovA values from old periods in order to calculate them for current periods (at most need to retain 1 old EWMA value for EWMA, don’t need any old MovA values for MovA, although you need to retain the mean and count of the k previous bins). If we assume k is less than the number of periods we generally retain, this means we never have to worry about needing to keep statistics from expired periods.
However, if you get some out of order timestamp that belongs in an older period, this will affect the moving average value for periods within window_size of the bin that holds the timestamp. The effect on EWMA values is more pervasive – probably negligible for for periods > span away from the re-updated bin. In this case you need to keep EWMA values from periods before the modified bucket in order to recalculate the EWMA. If your data is coming in in a completely non-time ordered way, there is no sense in the above scheme for calculating the EWMA/MovA values for bins as you go, since you’ll just have to recalculate everything as new points come in for old time buckets. For that case, you should use queries.
Queries: Calculate the aggregation on demand without storing partial results. If we define the window size as the number of periods it spans, a query would look something like:
GET /v1/entity/<UUID>/measures?start='2014-06-21 23:23:20’&stop='2014-06-22 23:23:20’&aggregation=ewma&span=20
or
GET /v1/entity/<UUID>/measures?start='2014-06-21 23:23:20’&stop='2014-06-22 23:23:20’&aggregation=mva&window=5
or
GET /v1/entity/<UUID>/measures?start='2014-06-21 23:23:20’&stop='2014-06-22 23:23:20’&aggregation=holtwinters&span=5&beta=.1
Adding scoped names might be useful as well (ie aggregation.window, aggregation.beta)
Handling boundaries: If you specify a start timestamp in the query that’s expired, start from the data that does exist. If you specify a start timestamp that needs data points from expired bins to calculate MovA/EWMA (basically you chose the oldest non-expired period), choose Strategy A of no prior history, and aggregate starting from there to the stop timestamp. Strategy B doesn’t really work here since you don’t know how many bins to retain as spanand k are not defined until the query is proposed. Another way would be to not allow the user to request Moving Average on timestamps unless the the window size is such that you don’t run into expired bins (or truncate the window size so that condition is met). If we count the oldest non-expired bin as b0 and if the start timestamp in the query t_i is in bin b_i, the condition would be that the window-size < 2*i. For EWMA since it returns values no matter what size the span is, probably going with Strategy A is easiest.
We could also implement a combination of both static parameters and query functionality. That is, specify window size, span, and beta upfront when creating the entity. Then if you later want say a different window size, do a query.
Summary:
If your data is not time ordered, you should only implement query functionality.
If your data is time ordered, you can implement static parameters (meaning you define window size/span/beta upon entity creation). Using static parameters avoids the problem of needing data from bins that have expired.
Whatever you decide to implement, be explicit in the documentation about whether Moving-Average[value, window_size] is centered on the value or not. Centering makes more sense to me conceptually, but up to the user.
In general, Strategy A of no prior history is simpler than Strategy B of taking old bins into account. In Strategy B you have to do more checking to see if the old bins aren’t expired, etc. Probably easiest to do Strategy A.
Right now the period-spanning statistics need the mean and count of each period. Since you can re-adjust the mean of a bin when you add new points in that time bucket by doing
(old-mean*old-count+new-value)/(old-count+1)
Really what Moving Average is doing to calculate the moving average over 3 periods is:
MovingAverage(3 bins)=(mean1count1+mean2count2+mean3*count3)/(count1+count2+count3)
so you don’t have to retain all data points, just the mean and count (which can be updated correctly even if new timepoints come in for old buckets).
Of course if you retained all the data points, it would be equivalent to do
MovingAverage(3 bins) = Sum(values in 3 bins wide interval)/Number of points
Notice that for count1=count2=count3the moving average is equal to the average of the averages:
MovingAverage(3bins) = (mean1+mean2+mean3)/3
In the end, it comes down to what the data is expected to look like. If you expect almost uniform sampling, then taking the average of averages should be good enough. If however, you expect 300 bins in one period and 2 in another and hugely noisy data…I can’t say for sure that the average of averages is worthless, but I think in most cases it would be a worse estimate of the mean of the k bins than when weighting by count.
The EWMA is a bit trickier. Technically what we are doing here is an exponential smooth over the mean values of the bins (if all we retain are the mean/count). E.g.:
EWMA(1stbin)=mean1 EWMA(2ndbin)=(1-alpha)EWMA(1stbin)+alpha*mean2 etc… where alpha =2/(1+span)
In order to do an EWMA on the raw data, you do need all the data points to be retained.
Summary of the Summary:
The question seems to be, do we retain all data points or not?
Right now, you can do period-spanning statistics on the aggregated-data (where we’re only looking at period-spanning averages). It requires some work, since you have to be careful about old data points changing the aggregated mean value in old periods, and really you’re only doing the EWMA on the means, not the actual raw data. Also, the window size/span are defined as the number of periods to span, not the number of raw data points, which could be confusing. However, you’re not storing all the data points, which seems to be the intention.
If you save all the data points, doing period-spanning statistics is conceptually more straightforward. You don’t have to define the window size in terms of periods, and EWMA can be done on the raw data. Moreover, you can do other period-spanning statistics that were just impossible with roll-up (period-spanning stddevs, etc). However, you’re now retaining all the data points, which seems to be something Gnocchi was trying to avoid. Functions Defined:
Just to define what we mean here, the output of a moving average reads: y[n] = 1/(2k+1)*sum(x[n-k:n+k+1])
where the odd number 2k+1is the window size and x[t]the input array (in our case, each element in x corresponds to the mean*count of a period). Increasing the window size increases the smoothing effect. In Pandas the moving average syntax is: y = pandas.rolling_mean(x, window_size, center=True)
If we retain all datapoints, we can call pd.rolling_mean on the raw data. If we are doing roll-up, you can’t use this syntax, but need to implement a moving_average that takes as input arrays mean, count corresponding to the periods being looked at:
moving_average(window_size=# of periods to aggregate over, mean, count)
The output of an exponentially weighted moving average is:
y[t] = x[t]+(1-)y[t-1], t>0 y[0]=x[0] y = pandas.ewma(x, com= (1-alpha)/alpha) or y = pandas.ewma(x, span = (2-alpha)/alpha)
As pandas.ewma takes in span as a parameter, rather than alpha, it is perhaps clearer to say that alpha=2/(1+span).alpha is a smoothing parameter between 0 and 1; for small alpha, weights decay quickly for old data points and they “fall off” in affecting the forecast. For large alpha, weights decay more slowly and many data points contribute to the forecast.
]]>