Search code examples
pythondistributionpoisson

Python: create a distribution from a list based on number of items that fall within certain ranges


I tagged this question with poisson as I am not sure if it will be helpful in this case.

I need to create a distribution (probably formatted as an image in the end) from a list of data.

For example:

data = [1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 10, 10, 10, 22, 30, 30, 35, 46, 58, 59, 59]

such that the data can be used to create a visual distribution. I might, for example in this case, say that the ranges are in 10 and there needs to be at least 3 items in each range to be a valid point.

With this example data, I would expect the result to be analogous to

ditribution = [1, 2, 4, 6]

since I have > 3 items in ranges 0-9, 10-19, 30-39 and 50-59. Using that result I could generate an image that has the sections segmented out (darker color) that exist in my final distribution. An example of the type of image I am trying to create can be seen below and would have been generated with far more data. Ignore the blue line for now.

I know how to do this the brute force way of iterating over every item in the list and doing my calculation like that. But, my data set may have hundreds of thousands, or even millions of numbers. My range (10) and my required number of items (3) will likely be much larger in a real world example.

distribution image

Thanks for any help.


Solution

  • If data is always sorted, a compact approach might be:

    import itertools as it
    
    d = [k+1 for k, L in
             ((k, len(list(g))) for k, g in it.groupby(data,key=lambda x:x//10))
         if L>=3]
    

    If data isn't sorted, or if you don't know, use sorted(data) as the first argument to itertools.groupby, instead of just data.

    If you prefer a less dense/compact approach, you can of course expand this, e.g. to:

    def divby10(x): return x//10
    
    distribution = []
    for k, g in it.groupby(data, key=divby10):
        L = len(list(g))
        if L < 3: continue
        distribution.append(k+1)
    

    In either case, the mechanism is that groupby first applies the callable passed as key= to each item in the iterable passed as its first argument, to obtain each item's"key"; for each consecutive group of items which have the same "key", groupby yields a tuple with two items: the value of the key, and an iterable over all items in said group.

    Here, the key is obtained by dividing an item by 10 (with truncation); len(list(g)) is the number of consecutive items with that "key". Since the items must be consecutive, you need the data to be sorted (and, it's simpler to just sort it, than sort it "by value divided by 10 with truncation";-).