Search code examples
pythonpandascumsum

Percentage of events before and after a sequence of zeros in pandas rows


I have a dataframe like the following:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total
-----------------------------------------------------------------------------------------------------
0       A       2   21  0   18  3   0   0   0   2   ... 0   0   0   0   0   0   0   0   0   0    156
1       B       0   20  12  2   0   8   14  23  0   ... 0   0   0   0   0   0   0   0   0   0    231
2       C       0   38  19  3   1   3   3   7   1   ... 0   0   0   0   0   0   0   0   0   0     78
3       D       3   0   0   1   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0      5

and I want to know the % of events (the numbers in the cells) before and after the first sequence of zeros of length n appears in each row. This problem started as another question found here: Length of first sequence of zeros of given size after certain column in pandas dataframe, and I am trying to modify the code to do what I need, but I keep getting errors and can't seem to find the right way. This is what I have tried:

def func(row, n):
    """Returns the number of events before the 
    first sequence of 0s of length n is found
    """

    idx = np.arange(0, 91)

    a = row[idx]
    b = (a != 0).cumsum()
    c = b[a == 0]
    d = c.groupby(c).count()

    #in case there is no sequence of 0s with length n
    try:
        e = c[c >= d.index[d >= n][0]]
        f = str(e.index[0])
    except IndexError:
        e = [90]
        f = str(e[0])

    idx_sliced = np.arange(0, int(f)+1)
    a = row[idx_sliced]

    if (int(f) + n > 90):
        perc_before = 100
    else:
        perc_before = a.cumsum().tail(1).values[0]/row['total']

    return perc_before

As is, the error I get is:

---> perc_before = a.cumsum().tail(1).values[0]/row['total']
TypeError: ('must be str, not int', 'occurred at index 0')

Finally, I would apply this function to a dataframe and return a new column with the % of events before the first sequence of n 0s in each row, like this:

        ID      0   1   2   3   4   5   6   7   8   ... 81  82  83  84  85  86  87  88  89  90  total  %_before
---------------------------------------------------------------------------------------------------------------
0       A       2   21  0   18  3   0   0   0   2   ... 0   0   0   0   0   0   0   0   0   0    156   43
1       B       0   20  12  2   0   8   14  23  0   ... 0   0   0   0   0   0   0   0   0   0    231   21
2       C       0   38  19  3   1   3   3   7   1   ... 0   0   0   0   0   0   0   0   0   0     78   90
3       D       3   0   0   1   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0      5   100

If trying to solve this, you can test by using this sample input:

a = pd.Series([1,1,13,0,0,0,4,0,0,0,0,0,12,1,1])
b = pd.Series([1,1,13,0,0,0,4,12,1,12,3,0,0,5,1])
c = pd.Series([1,1,13,0,0,0,4,12,2,0,5,0,5,1,1])
d = pd.Series([1,1,13,0,0,0,4,12,1,12,4,50,0,0,1])
e = pd.Series([1,1,13,0,0,0,4,12,0,0,0,54,0,1,1])

df = pd.DataFrame({'0':a, '1':b, '2':c, '3':d, '4':e})
df = df.transpose()

Solution

  • Give this a try:

    def percent_before(row, n, ncols):
        """Return the percentage of activities happen before
        the first sequence of at least `n` consecutive 0s
        """
        start_index, i, size = 0, 0, 0
        for i in range(ncols):
            if row[i] == 0:
                # increase the size of the island
                size += 1
            elif size >= n:
                # found the island we want
                break
            else:
                # start a new island
                # row[start_index] is always non-zero
                start_index = i
                size = 0
    
        if size < n:
            # didn't find the island we want
            return 1
        else:
            # get the sum of activities that happen
            # before the island
            idx = np.arange(0, start_index + 1).astype(str)
            return row.loc[idx].sum() / row['total']
    
    df['percent_before'] = df.apply(percent_before, n=3, ncols=15, axis=1)
    

    Result:

       0  1   2  3  4  5  6   7  8   9  10  11  12  13  14  total  percent_before
    0  1  1  13  0  0  0  4   0  0   0   0   0  12   1   1     33        0.454545
    1  1  1  13  0  0  0  4  12  1  12   3   0   0   5   1     53        0.283019
    2  1  1  13  0  0  0  4  12  2   0   5   0   5   1   1     45        0.333333
    3  1  1  13  0  0  0  4  12  1  12   4  50   0   0   1     99        0.151515
    4  1  1  13  0  0  0  4  12  0   0   0  54   0   1   1     87        0.172414
    

    For the full frame, call apply with ncols=91.