The following code works, but it breaks when applied to my dataframe due to Pandas unable to allocate GiB for an array with shape X and data type Y
. I tried to solve the problem with a dask.dataframe
, but it also didn't work. I hope that my code can be improved. Maybe anyone here knows how.
It starts with a dataframe comparable to:
import pandas as pd
data = {'item':['a', 'b', 'c', 'd', 'e', 'f'],
'trial1':['blue', 'green', 'red', 'blue', 'blue', 'green'],
'trial2':['green', 'blue', '', '', '', 'red'],
'trial3':['red', '', '', '', '', 'blue'],
'trial4':['gray', '', '', '', '', 'gray'],
'trial5':['black','', '', '', '', '']}
df = pd.DataFrame(data)
df
item trial1 trial2 trial3 trial4 trial5
0 a blue green red gray black
1 b green blue
2 c red
3 d blue
4 e blue
5 f green red blue gray
Note that each color occurs only once for every item, that is, there are no duplicated cells in a row in columns trial1 .. trial5
. (The original dataframe has 10 trials, 300000 items, and 30000 unique 'colors'). I want for every item a one-hot encoding for its unique colors. First, I compute the one-hot encoding for every trial:
columns = ['trial1', 'trial2', 'trial3', 'trial4', 'trial5']
oneHot = pd.get_dummies(df[columns], sparse=True, prefix='', prefix_sep='')
Second, I sum up columns that refer to the same color; the result will either be 1
or 0
. This is the code that either breaks or runs for days with my original dataframe:
oneHotAgg = oneHot.groupby(oneHot.columns, axis=1).sum()
oneHotAgg = oneHotAgg.iloc[:, 1:] # don't know why this column without name is added; just delete it
Third, I combine the one-hot encoding with the items again:
result = pd.concat([df.item, oneHotAgg], axis=1)
The resulting dataframe looks as follows:
item black blue gray green red
0 a 1 1 1 1 1
1 b 0 1 0 1 0
2 c 0 0 0 0 1
3 d 0 1 0 0 0
4 e 0 1 0 0 0
5 f 0 1 1 1 1
Are there other solutions to this problem? Especially more efficient solutions? Any suggestions are welcome!!!
P.S. There are some solutions available for my problem, on which my code is based. Especially this questions was very helpful. Note the comment on the solution proposed by BENY (accepted answer).
Use:
columns = ['trial1', 'trial2', 'trial3', 'trial4', 'trial5']
df1 = df.melt('item').loc[lambda x: x['value'].ne('')]
oneHot = pd.crosstab(df1['item'], df1['value']).gt(0).astype(int)
print (oneHot)
value black blue gray green red
item
a 1 1 1 1 1
b 0 1 0 1 0
c 0 0 0 0 1
d 0 1 0 0 0
e 0 1 0 0 0
f 0 1 1 1 1