Search code examples
pythondata-structuresdatezodb

Algorithm in Python to store and search daily occurrence for thousands of numbered events?


I'm investigating solutions of storing and querying a historical record of event occurrences for a large number of items.

This is the simplified scenario: I'm getting a daily log of 200 000 streetlamps (labeled sl1 to sl200000) which shows if the lamp was operational on the day or not. It does not matter for how long the lamp was in service only that it was on a given calendar day.

Other bits of information are stored for each lamp as well and the beginning of the Python class looks something like this:

class Streetlamp(object):
    """Class for streetlamp record"""
    def __init__(self, **args):
        self.location = args['location']
        self.power = args['power']
        self.inservice = ???

My py-foo is not too great and I would like to avoid a solution which is too greedy on disk/memory storage. So a solution with a dict of (year, month, day) tuples could be one solution, but I'm hoping to get pointers for a more efficient solution.

A record could be stored as a bit stream with each bit representing a day of a year starting with Jan 1. Hence, if a lamp was operational the first three days of 2010, then the record could be:

sl1000_up = dict('2010': '11100000000000...', '2011':'11111100100...')

Search across year boundaries would need a merge, leap years are a special case, plus I'd need to code/decode a fair bit with this home grown solution. It seems not quiet right. speed-up-bitstring-bit-operations, how-do-i-find-missing-dates-in-a-list and finding-data-gaps-with-bit-masking where interesting postings I came across. I also investigated python-bitstring and did some googling, but nothing seems to really fit.

Additionally I'd like search for 'gaps' to be possible, e.g. 'three or more days out of action' and it is essential that a flagged day can be converted into a real calendar date.

I would appreciate ideas or pointers to possible solutions. To add further detail, it might be of interest that the back-end DB used is ZODB and pure Python objects which can be pickled are preferred.


Solution

  • Create a 2D-array in Numpy:

    import numpy as np
    
    nbLamps = 200000
    nbDays = 365
    
    arr = np.array([nbLamps, nbDays], dtype=np.bool)
    

    It will be very memory-efficient and you can aggregate easily the days and lamps.

    In order to manipulate the days even better, have a look at scikits.timeseries. They will allow you to access the dates with datetime objects.