Search code examples
pythonpandaspython-itertools

Count consecutive ones in a dataframe and get indices where this occurs


I have a pandas.DataFrame with integer column names, which has zeroes and ones. An example of the input:

    12  13  14  15
1   0   0   1   0
2   0   0   1   1
3   1   0   0   1
4   1   1   0   1
5   1   1   1   0
6   0   0   1   0
7   0   0   1   1
8   1   1   0   1
9   0   0   1   1
10  0   0   1   1
11  1   1   0   1
12  1   1   1   1
13  1   1   1   1
14  1   0   1   1
15  0   0   1   1

I need to count all consecutive ones which has a length/sum which is >=2, iterating through columns and returning also indices where an array of the consecutive ones occurs (start, end).

The preferred output would be a 3D DataFrame, where subcolumns "count" and "indices" refer to integer column names from the input.

An example output would look like this one:

12              13              14              15
count   indices count   indices count   indices count   indices
    3     (3,5)     2     (4,5)     2     (1,2)     3     (2,4)
    4   (11,14)     3   (11,13)     3     (5,7)     9    (7,15)
                                    2    (9,10) 
                                    4   (12,15)     

I suppose it should be solved with itertools.groupby, but still can't figure out how to apply it to such problem, where both groupby results and its indices are being extracted.


Solution

  • Here is one way to calculate the desired run lengths:

    Code:

    def min_run_length(series):
        terminal = pd.Series([0])
        diffs = pd.concat([terminal, series, terminal]).diff()
        starts = np.where(diffs == 1)
        ends = np.where(diffs == -1)
        return [(e-s, (s, e-1)) for s, e in zip(starts[0], ends[0])
                if e - s >= 2]
    

    Test Code:

    df = pd.read_fwf(StringIO(u"""
        12  13  14  15
        0   0   1   0
        0   0   1   1
        1   0   0   1
        1   1   0   1
        1   1   1   0
        0   0   1   0
        0   0   1   1
        1   1   0   1
        0   0   1   1
        0   0   1   1
        1   1   0   1
        1   1   1   1
        1   1   1   1
        1   0   1   1
        0   0   1   1"""), header=1)
    print(df.dtypes)
    
    indices = {cname: min_run_length(df[cname]) for cname in df.columns}
    print(indices)
    

    Results:

    {
     u'12': [(3, (3, 5)), (4, (11, 14))], 
     u'13': [(2, (4, 5)), (3, (11, 13))], 
     u'14': [(2, (1, 2)), (3, (5, 7)), (2, (9, 10)), (4, (12, 15))]
     u'15': [(3, (2, 4)), (9, (7, 15))], 
    }