Search code examples
rpython-2.7numpypandasgraphlab

Data transformation for machine learning


I have dataset with SKU IDs and their counts, i need to feed this data into a machine learning algorithm, in a way that SKU IDs become columns and COUNTs are at the intersection of transaction id and SKU ID. Can anyone suggest how to achieve this transformation.

CURRENT DATA

TransID     SKUID      COUNT
1           31         1  
1           32         2 
1           33         1  
2           31         2  
2           34         -1  

DESIRED DATA

TransID      31      32      33      34
  1          1        2      1       0
  2          2        0      0       -1  

Solution

  • In Pandas, you can use pivot:

    >>> df.pivot('TransID', 'SKUID').fillna(0)
            COUNT         
    SKUID      31 32 33 34
    TransID               
    1           1  2  1  0
    2           2  0  0 -1
    

    To avoid ambiguity, it is best to explicitly label your variables:

    df.pivot(index='TransID', columns='SKUID').fillna(0)
    

    You can also perform a groupby and then unstack SKUID:

    >>> df.groupby(['TransID', 'SKUID']).COUNT.sum().unstack('SKUID').fillna(0)
    SKUID    31  32  33  34
    TransID                
    1         1   2   1   0
    2         2   0   0  -1