Search code examples
pythonpandasscikit-learnencodedictvectorizer

How to use Scikit Learn dictvectorizer to get encoded dataframe from dense dataframe in Python?


I have a dataframe as follows:

   user  item  affinity
0     1    13       0.1
1     2    11       0.4
2     3    14       0.9
3     4    12       1.0

From this I want to create an encoded dataset (for fastFM) as follows:

  user1 user2 user4 user4 item11 item12 item13 item14 affinity
    1     0     0     0     0      0      1      0       0.1
    0     1     0     0     1      0      0      0       0.4
    0     0     1     0     0      0      0      1       0.9
    0     0     0     1     0      1      0      0       1.0

Do I need a dictvectorizer from sklearn? If yes, then is there a way to convert original dataframe to dictionary which can be given to dictvectorizer which will in turn give me the encoded dataset as shown?


Solution

  • You can use get_dummies with concat If values in columns user or item are numeric, cast to string by astype:

    df = pd.DataFrame({'item': {0: 13, 1: 11, 2: 14, 3: 12}, 
                       'affinity': {0: 0.1, 1: 0.4, 2: 0.9, 3: 1.0},
                       'user': {0: 1, 1: 2, 2: 3, 3: 4}},
                        columns=['user','item','affinity'])
    print df
       user  item  affinity
    0     1    13       0.1
    1     2    11       0.4
    2     3    14       0.9
    3     4    12       1.0
    
    df1 = df.user.astype(str).str.get_dummies()
    df1.columns = ['user' + str(x) for x in df1.columns]
    print df1
       user1  user2  user3  user4
    0      1      0      0      0
    1      0      1      0      0
    2      0      0      1      0
    3      0      0      0      1
    
    df2 = df.item.astype(str).str.get_dummies()
    df2.columns = ['item' + str(x) for x in df2.columns]
    print df2
       item11  item12  item13  item14
    0       0       0       1       0
    1       1       0       0       0
    2       0       0       0       1
    3       0       1       0       0
    
    print pd.concat([df1,df2, df.affinity], axis=1)
       user1  user2  user3  user4  item11  item12  item13  item14  affinity
    0      1      0      0      0       0       0       1       0       0.1
    1      0      1      0      0       1       0       0       0       0.4
    2      0      0      1      0       0       0       0       1       0.9
    3      0      0      0      1       0       1       0       0       1.0
    

    Timings:

    len(df) = 4:

    In [49]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
    The slowest run took 4.91 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 690 µs per loop
    

    len(df) = 40:

    df = pd.concat([df]*10).reset_index(drop=True)
    
    In [51]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
    The slowest run took 5.56 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 719 µs per loop
    

    len(df) = 400:

    df = pd.concat([df]*100).reset_index(drop=True)
    
    In [43]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
    The slowest run took 4.55 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 748 µs per loop
    

    len(df) = 4k:

    df = pd.concat([df]*1000).reset_index(drop=True)
    
    In [41]: %timeit pd.concat([df1,df2, df.affinity], axis=1)
    The slowest run took 4.67 times longer than the fastest. This could mean that an intermediate result is being cached 
    1000 loops, best of 3: 761 µs per loop
    

    len(df) = 40k:

    df = pd.concat([df]*10000).reset_index(drop=True)
    
    %timeit pd.concat([df1,df2, df.affinity], axis=1)
    1000 loops, best of 3: 1.83 ms per loop
    

    len(df) = 400k:

    df = pd.concat([df]*100000).reset_index(drop=True)
    
    %timeit pd.concat([df1,df2, df.affinity], axis=1)
    100 loops, best of 3: 15.6 ms per loop