Search code examples
pythonscikit-learnsvmpredictiondictvectorizer

Converting string data to float before passing to SVM classifier


I have a dataset as follows:

X_data = 

BankNum   |  ID | 

00987772  | AB123 | 
00987772  | AB123 |
00987772  | AB123 |
00987772  | ED245 |
00982123  | GH564 |

And another one as:

y_data =

ID  | Labels

AB123 | High
ED245 | Low
GH564 | Low

I'm doing the following:

from sklearn import svm
from sklearn import metrics
import numpy as np

clf = svm.SVC(gamma=0.001, C=100., probability=True)
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.20, random_state=42)
clf.fit(X_train, y_train)

predicted = clf.predict(X_test)

But I want to know how do I transform this X_data to float before I do clf.fit()? Can I use DictVectorizer in this case? If yes, then how do I use it?

Also, I'm passing X_data and y_data through train_test_split to find out the prediction accuracy, but will it be splitting correctly? As in taking the correct Label for a ID in X_data from y_data?

UPDATE:

Can someone please tell me if I'm doing the following correctly?

new_df = pd.merge(df, df3, on="ID")
columns = ['BankNum', 'ID']
labels = new_df['Labels']
le = LabelEncoder()
labels = le.fit_transform(labels)
X_train, X_test, y_train, y_test = train_test_split(new_df[columns], labels, test_size=0.25, random_state=42)
X_train.fillna( 'NA', inplace = True )
X_test.fillna( 'NA', inplace = True )
x_cat_train = X_train.to_dict( orient = 'records' )
x_cat_test = X_test.to_dict( orient = 'records' )
vectorizer = DictVectorizer( sparse = False )
vec_x_cat_train = vectorizer.fit_transform( x_cat_train )
vec_x_cat_test = vectorizer.transform( x_cat_test )
x_train = vec_x_cat_train
x_test = vec_x_cat_test
clf = svm.SVC(gamma=0.001, C=100., probability=True)
clf.fit(x_train, y_train)

Solution

  • my suggestion according to what we discus in comment is first to merge the x_data and y_data datasets on the id columns:

    dataset = pd.merge(left=x_data, right=y_data, on='index')
    

    and the you can transform the BANKacount columns to float by using np.astype :

    dataset['Bank_Num'] = dataset.Bank_Num.astype(np.float128)
    

    NB (update): Label _encoder can also works for Bank_Num if it contain some plain strings values :

    dataset['Bank_Num'] = le.fit_transform(dataset.Bank_Num)
    

    the ID columns by using label encoder to get the int representation of it :

    from sklearn.preprocessing import LabelEncoder,LabelBinarizer
    le = LabelEncoder()
    dataset['index'] = le.fit_transform(dataset.index)
    

    and the y label by using labelBinarizer :

    lb = LabelBinarizer()
    dataset['label'] = lb.fit_transform(dataset.label)
    

    now you have a full dataset with int and float and your SVC can works well with it but before you need to split:

    it is a good ideas to have a test size inferior to the train size , it may be preferable to use a value inferior to 0.5 for test_size find more about training set and test set size here

    like this :

    X_train, X_test, y_train, y_test = train_test_split(dataset[['index','Bank_Num']], dataset.label, test_size=0.25, random_state=42)
    

    with this you can now train your classifier witout any problems:

    clf.fit(X_train, y_train)
    

    NB : in my code index is equivalent to your ID

    Let me know if this help and how I can improve my answer