Search code examples
pythonpandasdataframelogistic-regression

Explode / unpivot DataFrame containing count data to one row per item


Context: Data transformation for a logistic regression problem. I have the following data structure:

df = pd.DataFrame({"group": ["A", "B"], "total": [3, 5], "occurrence": [2, 1]})

I want to do sth. like pd.explode, but creating one row for item of total, i.e. 5+6 rows where occurrence number of rows hold 1 and the rest 0 (either in the occurrence columns or a new target column).

Currently I'm doing it iteratively which is quite slow on large data:

expanded = []
for ix, row in df.iterrows():
    for i in range(row["total"]):
        row["y"] = 1 if i < row["occurrence"] else 0
        expanded.append(row.copy())
df_out = pd.DataFrame(expanded).reset_index(drop=True)
df_out.drop(["total", "occurrence"], axis=1, inplace=True)
df_out


  group  y
0     A  1
1     A  1
2     A  0
3     B  1
4     B  0
5     B  0
6     B  0
7     B  0


Solution

  • You could repeat the rows, then assign a new column based on the output from groupby.cumcount:

    out = (df.loc[df.index.repeat(df['total']), ['group', 'occurrence']]
             .assign(y=lambda x: x.groupby(level=0).cumcount()
                                  .lt(x.pop('occurrence'))
                                  .astype(int))
          )
    

    Or using 's tile/repeat:

    import numpy as np
    
    out = pd.DataFrame({'group': np.repeat(df['group'], df['total']),
                        'y': np.repeat(np.tile([1, 0], len(df)),
                                       np.r_[df['occurrence'],
                                             df['total']-df['occurrence']])
                        })
    

    Output:

      group  y
    0     A  1
    0     A  1
    0     A  0
    1     B  1
    1     B  0
    1     B  0
    1     B  0
    1     B  0
    

    If you want to use explode:

    out = (df.assign(y= lambda x: [[1 if i<o else 0 for i in range(t)] for
                                   t, o in zip(x['total'], x['occurrence'])])
             .explode('y')
          )
    

    Output:

      group  total  occurrence  y
    0     A      3           2  1
    0     A      3           2  1
    0     A      3           2  0
    1     B      5           1  1
    1     B      5           1  0
    1     B      5           1  0
    1     B      5           1  0
    1     B      5           1  0
    

    timings

    On 20k rows. The pure numpy approach is the fastest.

    # original approach
    9.96 s ± 248 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # repetition
    7.17 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # numpy tile+repeat
    1.31 ms ± 23.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
    
    # explode
    18.2 ms ± 959 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    # numpy + list comprehension (PaulS)
    125 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    # numba (PaulS)
    3.88 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)