Search code examples
pythonpandasvectorizationone-hot-encoding

Convert count row to one hot encoding efficiently


I have a table with rows in this format where the integers are a count:

   A  B  C  D  E
0  a  2  0  3  x
1  b  1  2  0  y

I'd like to convert it into a format where each count is a one hot encoded row:

   A  B  C  D  E
0  a  1  0  0  x
1  a  1  0  0  x
2  a  0  0  1  x
3  a  0  0  1  x
4  a  0  0  1  x
5  b  1  0  0  y
6  b  0  1  0  y
7  b  0  1  0  y

I wrote inefficient code which achieves this

# Sample DataFrame
data = {
    'A': ['a', 'b'],
    'B': [2, 1],
    'C': [0, 2],
    'D': [3, 0],
    'E': ['x', 'y']
}
df = pd.DataFrame(data)

new_df = pd.DataFrame(columns=df.columns)

for index, row in df.iterrows():
    first_val = row.iloc[0]
    last_val = row.iloc[-1]
    middle_vals = row.iloc[1:-1].astype(int)
    for i in range(len(middle_vals)):
        new_data = [first_val] + [1 if i == j else 0 for j in range(len(middle_vals))] + [last_val]
        new_rows = pd.DataFrame([new_data] * middle_vals.iloc[i], columns=df.columns)
        new_df = pd.concat([new_df, new_rows], ignore_index=True)

Any tips for vectorizing this operation which is incredibly slow? I realize a concat operation per iteration is a big issue, so I did try a batching solution where I collect chunks of new_rows and then concat. This remains slow.


Solution

  • Here is a full solution, I would expect this to be faster than reshaping:

    import numpy as np
    
    num_cols = ['B', 'C', 'D']
    
    # convert to numpy array
    a = df[num_cols].to_numpy()
    
    # build indices to repeat
    idx = np.repeat(np.arange(a.shape[0]), a.sum(1))
    # array([0, 0, 0, 0, 0, 1, 1, 1])
    
    # build column indices to repeat
    cols = np.repeat(np.tile(np.arange(a.shape[1]), a.shape[0]), a.flat)
    # array([0, 0, 2, 2, 2, 0, 1, 1])
    
    # assign 1s
    b = np.zeros((len(idx), len(num_cols)), dtype=int)
    b[np.arange(len(idx)), cols] = 1
    
    # repeat and update DataFrame
    out = df.iloc[idx]
    out.loc[:, num_cols] = b
    

    Output:

       A  B  C  D  E
    0  a  1  0  0  x
    0  a  1  0  0  x
    0  a  0  0  1  x
    0  a  0  0  1  x
    0  a  0  0  1  x
    1  b  1  0  0  y
    1  b  0  1  0  y
    1  b  0  1  0  y
    

    timings

    # setup
    N = 10_000
    df = (pd.DataFrame(np.clip(np.random.randint(-10, 15, (N, 10)), 0, 15)).assign(A=np.arange(N), E=np.arange(N))
          [['A', 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 'E']]
         )
    
    # numpy approach (this one)
    45.2 ms ± 6.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    # list approach (@EmiOB)
    249 ms ± 36.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    # pure pandas melt+repeat+pivot approach
    3.68 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    

    comparison for up to ~2M rows (10 cols of numbers) timings