Search code examples
pythoncsvmachine-learningfeature-extractiontraining-data

I would like to consider a feature set(vector) for a data in python for my machine learning algorithm. How can I do it?


I have data in the following form

   Class           Feature set list
   classlabel1 -    [size,time]      example:[6780.3,350.00]
   classlabel2 -    [size,time]
   classlabel3 -    [size,time]
   classlabel4 -    [size,time]

How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.

I have tried saving the feature set list in a dataframe and saving this dataframe to a csv file. But the size and time are getting split into two different columns.

The dataframe is getting saved in csv file in the following way:

col 0    col1        col2
62309   396.5099154  label1

I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?


Solution

  • Firstly responding to your question:

    I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?

    Combining the two is not the right thing to do because both are in two different scales (if they are actually what there name suggests) and also combining them will result in loss of information which they will provide, so they are two totally independent features for any ML supervised algorithm. So I would suggest to treat these two features separately rather than combining into one.

    Now let's move onto to next section:

    How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.

    1. Storing data : In my opinion, you can store data in whichever format you want but I would prefer storing data in csv format as it is convenient and loading of data file is faster.

    sample_data.csv

    size,time,class_label
    100,150,label1
    200,250,label2
    240,180,label1
    

    Below is the code for reading the data from csv and training SVM :

    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn import preprocessing
    from sklearn.svm import SVC
    from sklearn.metrics import accuracy_score
    
    # loading data
    data = pd.read_csv("sample_data.csv", error_bad_lines=True,
        warn_bad_lines=True)
    
    # Dividing into dependent and independent features
    Y = data.class_label_col.values
    X = data.drop("class_label_col", axis=1).values
    
    # encode the class column values
    label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))
    
    # split training and testing data
    x_train,x_test,y_train,y_test=train_test_split(X,label_encoded_Y,
    train_size=0.8,
    test_size=0.2)
    
    # Now use the whichever trainig algo you want
    clf = SVC(gamma='auto')
    clf.fit(x_train, y_train) 
    
    # Using the predictor
    y_pred = clf.predict(x_test)