Search code examples
pythonpandascsvparsingdummy-data

How to parse pandas Dataframe object


I read csv file in pandas Dataframe and then get its dummy and concat them, but for example I Have column named "Genre" and it contains "comedy, drama" and "action, comedy" so when I get dummy and concat them it makes a object for each sentence but I want parse them.for example I want to makes object 'Genre.comedy' , 'Genre.Drama', 'Genre.action' instead of 'Genre.comedy,drama' and 'Genre.action,comedy' Here is my code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import csv
from sklearn import preprocessing
trainset = pd.read_csv("/Users/yada/Downloads/IMDBMovieData.csv", encoding='latin-1')
X = trainset.drop(['Description', 'Runtime'], axis=1)
features = ['Genre','Actors']
for f in features:
    X_dummy = pd.get_dummies(X[f], prefix = f)
    X = X.drop([f], axis = 1)
    X = pd.concat((X, X_dummy), axis = 1)

and this is the some row of my csv file: csv


Solution

  • I think need str.get_dummies with add_prefix:

    features = ['Genre','Actors']
    for f in features:
        X_dummy = X[f].str.get_dummies(', ').add_prefix(f + '.')
        X = X.drop([f], axis = 1)
        X = pd.concat((X, X_dummy), axis = 1)
    

    Or:

    trainset = pd.DataFrame({'Description':list('abc'),
                       'Genre':['comedy, drama','action, comedy','action'],
                       'Actors':['a, b','a, c','d, a'],
                       'Runtime':[1,3,5],
                       'E':[5,3,6],
                       'F':list('aaa')})
    
    print (trainset)
      Description           Genre Actors  Runtime  E  F
    0           a   comedy, drama   a, b        1  5  a
    1           b  action, comedy   a, c        3  3  a
    2           c          action   d, a        5  6  a
    
    X = trainset.drop(['Description', 'Runtime'], axis=1)
    features = ['Genre','Actors']
    X_dummy_list = [X.pop(f).str.get_dummies(', ').add_prefix(f + '.') for f in features]
    X = pd.concat([X] + X_dummy_list , axis = 1)
    print (X)
    
       E  F  Genre.action  Genre.comedy  Genre.drama  Actors.a  Actors.b  \
    0  5  a             0             1            1         1         1   
    1  3  a             1             1            0         1         0   
    2  6  a             1             0            0         1         0   
    
       Actors.c  Actors.d  
    0         0         0  
    1         1         0  
    2         0         1