Search code examples
pythoncsvscikit-learnclassificationsklearn-pandas

Processing Word Data For Input into Scikit-Learn's SVC Algorithm


Let's say people email me with problems they are experiencing with a program. I would like to teach the machine to classify these emails into "issue type" classes based on the words used in each email.

I have created two CSV files which respectively contain:

  • the word contents of each email
  • the class each email would be labeled as

Here is an image showing the two CSV files

I'm attempting to feed these data into Scikit-Learn's SVC algorithm in Python 3. But, as far as I can tell, the CSV file with email contents can’t be directly passed into SVC; it seems to only accept floats.

I try to run the following code:

import pandas as pd 
import os 
from sklearn import svm 
from pandas import DataFrame 


data_file = "data.csv" 
data_df = pd.read_csv(data_file, encoding='ISO-8859-1')

classes_file = "classes.csv" 
classes_df = pd.read_csv(classes_file, encoding='ISO-8859-1')


X = data_df.values[:-1] #training data
y = classes_df.values[:-1] #training labels
     #The SVM classifier requires the specific variables X and y
         #an array X of size [n_samples, n_features] holding the training samples, 
         #and an array y of class labels (strings or integers), size [n_samples]

clf = svm.SVC(gamma=0.001, C=100)
clf.fit(X, y)

When I run this, I receive a "ValueError" on the final line, stating "could not convert string to float", followed by the contents of the first email in the "data.csv" file. Do I need to convert these email contents to floats in order to feed them into the SVC algorithm? If so, how would I go about doing that?

I've been reading at http://scikit-learn.org/stable/datasets/index.html#external-datasets and it states

Categorical (or nominal) features stored as strings (common in pandas DataFrames) will need converting to integers, and integer categorical variables may be best exploited when encoded as one-hot variables

Which then leads me to their documentation on PreProcessing Data, but I'm afraid I've become a bit lost as to where to go next. I'm not entirely sure what, exactly, I need to do with my email contents in order for it to work with the SVC algorithm.

I'd greatly appreciate any insights anyone could offer on how to approach this problem.


Solution

  • Yes you need to encode the categorical features and the use them then for the SVC.

    You can use DictVectorizer for the data_df features and then LabelEncoder for the classes_df.

    This is the sample data that I used : https://www.dropbox.com/sh/kne5wopgzeuah0u/AABKTuc3_1czzI0hIpZWPkLwa?dl=0

    Using your exact same data the following works fine:

    import pandas as pd
    from sklearn.feature_extraction import DictVectorizer
    from sklearn import preprocessing
    from sklearn import svm 
    
    data_file = "data.csv" 
    data_df = pd.read_csv(data_file, encoding='ISO-8859-1')
    
    classes_file = "classes.csv" 
    classes_df = pd.read_csv(classes_file, encoding='ISO-8859-1')
    
    # label encoding
    lab_enc = preprocessing.LabelEncoder()
    labels_new = lab_enc.fit_transform(classes_df) 
    
    # vectorize training data
    train_as_dicts = [dict(r.iteritems()) for _, r in data_df.iterrows()]
    train_new = DictVectorizer(sparse=False).fit_transform(train_as_dicts)
    
    clf = svm.SVC(gamma=0.001, C=100)
    clf.fit(train_new, labels_new)
    

    This works fine.

    Hope this helps

    EDIT

    I used the following text found on internet as a feature in data.csv.

    The following is the first element of the Description column.

    But shortly after that first report, it was shown the initial statement was misleading. The Times reported that Trump Jr. accepted the meeting in hopes that it would yield damaging information on Hillary Clinton, and Trump Jr. said it had not. After the Times obtained an email chain showing an acquaintance, Rob Goldstone, offered Trump Jr. a meeting where he could obtain information as part of a Russian government effort to help his father's campaign, Trump Jr. posted the emails online.But shortly after that first report, it was shown the initial statement was misleading. The Times reported that Trump Jr. accepted the meeting in hopes that it would yield damaging information on Hillary Clinton, and Trump Jr. said it had not. After the Times obtained an email chain showing an acquaintance, Rob Goldstone, offered Trump Jr. a meeting where he could obtain information as part of a Russian government effort to help his father's campaign, Trump Jr. posted the emails online.

    The length is:

    len(data_df['Description'][0])
    
    982
    

    The code worked fine again.

    EDIT 2

    I am using:

    sklearn.__version__
    '0.18.2'
    
    pandas.__version__
    u'0.20.3'