Search code examples
pythonpandasgroup-bysumone-hot-encoding

Unable to allocate GiB - pandas get_dummies on multiple columns


The following code works, but it breaks when applied to my dataframe due to Pandas unable to allocate GiB for an array with shape X and data type Y. I tried to solve the problem with a dask.dataframe, but it also didn't work. I hope that my code can be improved. Maybe anyone here knows how.

It starts with a dataframe comparable to:

import pandas as pd
 
data = {'item':['a', 'b', 'c', 'd', 'e', 'f'], 
        'trial1':['blue', 'green', 'red', 'blue', 'blue', 'green'], 
        'trial2':['green', 'blue', '', '', '', 'red'],
        'trial3':['red', '', '', '', '', 'blue'],
        'trial4':['gray', '', '', '', '', 'gray'],
        'trial5':['black','', '', '', '', '']}
 
df = pd.DataFrame(data)
df

    item    trial1  trial2  trial3  trial4  trial5
0   a       blue    green   red     gray    black
1   b       green   blue            
2   c       red                 
3   d       blue                
4   e       blue                
5   f       green   red     blue    gray    

Note that each color occurs only once for every item, that is, there are no duplicated cells in a row in columns trial1 .. trial5. (The original dataframe has 10 trials, 300000 items, and 30000 unique 'colors'). I want for every item a one-hot encoding for its unique colors. First, I compute the one-hot encoding for every trial:

columns = ['trial1', 'trial2', 'trial3', 'trial4', 'trial5']
oneHot = pd.get_dummies(df[columns], sparse=True, prefix='', prefix_sep='')

Second, I sum up columns that refer to the same color; the result will either be 1 or 0. This is the code that either breaks or runs for days with my original dataframe:

oneHotAgg = oneHot.groupby(oneHot.columns, axis=1).sum()
oneHotAgg = oneHotAgg.iloc[:, 1:] # don't know why this column without name is added; just delete it

Third, I combine the one-hot encoding with the items again:

result = pd.concat([df.item, oneHotAgg], axis=1)

The resulting dataframe looks as follows:

    item    black   blue    gray    green   red
0   a       1       1       1       1       1
1   b       0       1       0       1       0
2   c       0       0       0       0       1
3   d       0       1       0       0       0
4   e       0       1       0       0       0
5   f       0       1       1       1       1

Are there other solutions to this problem? Especially more efficient solutions? Any suggestions are welcome!!!

P.S. There are some solutions available for my problem, on which my code is based. Especially this questions was very helpful. Note the comment on the solution proposed by BENY (accepted answer).


Solution

  • Use:

    columns = ['trial1', 'trial2', 'trial3', 'trial4', 'trial5']
    df1 = df.melt('item').loc[lambda x: x['value'].ne('')]
    oneHot = pd.crosstab(df1['item'], df1['value']).gt(0).astype(int)
    print (oneHot)
    value  black  blue  gray  green  red
    item                                
    a          1     1     1      1    1
    b          0     1     0      1    0
    c          0     0     0      0    1
    d          0     1     0      0    0
    e          0     1     0      0    0
    f          0     1     1      1    1