Search code examples
pythonpandasencodingone-hot-encodingcountvectorizer

Encoding multiple columns


In the case a dataframe has two or more columns with numerical and text values, and one Label/Target column, if I want to apply a model like svm, how can I use only the columns I am more interested in? Ex.

Data                                     Num       Label/Target   No_Sense
What happens here?                       group1               1   Migrate
Customer Management                      group2               0   Change Stage
Life Cycle Stages                        group1               1   Restructure
Drop-down allows to select status type   group3               1   Restructure Status

and so.

The approach I have taken is

1.encode "Num" column:

one_hot = pd.get_dummies(df['Num'])
df = df.drop('Num',axis = 1)
df = df.join(one_hot)

2.encode "Data" column:

def bag_words(df):
        
    df = basic_preprocessing(df)
    
    count_vectorizer = CountVectorizer()
    count_vectorizer.fit(df['Data'])
    
    list_corpus = df["Data"].tolist()
    list_labels = df["Label/Target"].tolist()
        
    X = count_vectorizer.transform(list_corpus)
        
    return X, list_labels

Then apply bag_words to the dataset

X, y = bag_words(df)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

Is there anything that I missed in these steps? How can I select only "Data" and "Num" features in my training dataset? (as I think "No_Sense" is not so relevant for my purposes)

EDIT: I have tried with

def bag_words(df):
            
    df = basic_preprocessing(df)
        
    count_vectorizer = CountVectorizer()
    count_vectorizer.fit(df['Data'])
        
    list_corpus = df["Data"].tolist()+ df["group1"].tolist()+df["group2"].tolist()+df["group3"].tolist() #<----
    list_labels = df["Label/Target"].tolist()
            
    X = count_vectorizer.transform(list_corpus)
            
    return X, list_labels

but I have found the error:

TypeError: 'int' object is not iterable

Solution

  • I hope this helps you:

    import pandas as pd
    import numpy as np
    import re
    
    from sklearn.feature_extraction.text import CountVectorizer
    
    #this part so I can recreate you df from the string you posted
    #remove this part !!!!
    
    data="""
    Data                        Num     Label/Target   No_Sense
    What happens here?         group1         1          Migrate
    Customer Management        group2         0          Change Stage
    Life Cycle Stages          group1         1          Restructure
    Drop-down allows to select status type  group3   1   Restructure Status
    """
    df = pd.DataFrame(np.array( [ re.split(r'\s{2,}', line) for line in lines[1:] ] ), 
                    columns = lines[0].split())
    
    
    #what you want starts from here!!!!:
    one_hot = pd.get_dummies(df['Num'])
    df = df.drop('Num',axis = 1)
    df = df.join(one_hot)
    
    #at this point you have 3 new fetures for 'Num' variable
    
    def bag_words(df):
    
        
    
        count_vectorizer = CountVectorizer()
        count_vectorizer.fit(df['Data'])
        matrix = count_vectorizer.transform(df['Data'])
    
        #this dataframe: `encoded_df`has 15 new features, these are the result of fitting 
        #the CountVectorizer to the 'Data' variable
        encoded_df = pd.DataFrame(data=matrix.toarray(), columns=["Data"+str(i) for i in range(matrix.shape[1])])
        
        #adding them to the dataframe
        df.join(encoded_df)
        
        #getting the numpy arrays that you can use in training
        X = df.loc[:, ["Data"+str(i) for i in range(matrix.shape[1])] + ["group1", "group2", "group3"]].to_numpy()
        y = df.loc[:, ["Label/Target"]].to_numpy()
    
        return X, y
    
    X, y = bag_words(df)