I have a dataframe as follows:
user item affinity
0 1 13 0.1
1 2 11 0.4
2 3 14 0.9
3 4 12 1.0
From this I want to create an encoded dataset (for fastFM
) as follows:
user1 user2 user4 user4 item11 item12 item13 item14 affinity
1 0 0 0 0 0 1 0 0.1
0 1 0 0 1 0 0 0 0.4
0 0 1 0 0 0 0 1 0.9
0 0 0 1 0 1 0 0 1.0
Do I need a dictvectorizer
from sklearn
? If yes, then is there a way to convert original dataframe to dictionary which can be given to dictvectorizer
which will in turn give me the encoded dataset as shown?
You can use get_dummies
with concat
If values in columns user
or item
are numeric, cast to string
by astype
:
df = pd.DataFrame({'item': {0: 13, 1: 11, 2: 14, 3: 12},
'affinity': {0: 0.1, 1: 0.4, 2: 0.9, 3: 1.0},
'user': {0: 1, 1: 2, 2: 3, 3: 4}},
columns=['user','item','affinity'])
print df
user item affinity
0 1 13 0.1
1 2 11 0.4
2 3 14 0.9
3 4 12 1.0
df1 = df.user.astype(str).str.get_dummies()
df1.columns = ['user' + str(x) for x in df1.columns]
print df1
user1 user2 user3 user4
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
df2 = df.item.astype(str).str.get_dummies()
df2.columns = ['item' + str(x) for x in df2.columns]
print df2
item11 item12 item13 item14
0 0 0 1 0
1 1 0 0 0
2 0 0 0 1
3 0 1 0 0
print pd.concat([df1,df2, df.affinity], axis=1)
user1 user2 user3 user4 item11 item12 item13 item14 affinity
0 1 0 0 0 0 0 1 0 0.1
1 0 1 0 0 1 0 0 0 0.4
2 0 0 1 0 0 0 0 1 0.9
3 0 0 0 1 0 1 0 0 1.0
Timings:
len(df) = 4
:
In [49]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.91 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 690 µs per loop
len(df) = 40
:
df = pd.concat([df]*10).reset_index(drop=True)
In [51]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 5.56 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 719 µs per loop
len(df) = 400
:
df = pd.concat([df]*100).reset_index(drop=True)
In [43]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.55 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 748 µs per loop
len(df) = 4k
:
df = pd.concat([df]*1000).reset_index(drop=True)
In [41]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 761 µs per loop
len(df) = 40k
:
df = pd.concat([df]*10000).reset_index(drop=True)
%timeit pd.concat([df1,df2, df.affinity], axis=1)
1000 loops, best of 3: 1.83 ms per loop
len(df) = 400k
:
df = pd.concat([df]*100000).reset_index(drop=True)
%timeit pd.concat([df1,df2, df.affinity], axis=1)
100 loops, best of 3: 15.6 ms per loop