Search code examples
pythonpandasnumpyscikit-learnsvm

Fitting SVC model on Khan Gene Data Set in Python


I have four .csv files containing the training data (data points and their classes) plus test data (data points and their classes) that has been stored into X_train, y_train, X_test and y_test variables respectively.

I need to train a CSV model and test it with the test data and as sklearn.svm.SVC gets numpy arrays as input I tried converting pandas data frames to numoy arrays as follows:

X_train_gene = pd.read_csv("Khan_xtrain.csv").drop('Unnamed: 0', axis=1).values.ravel()
y_train_gene = pd.read_csv("Khan_ytrain.csv").drop('Unnamed: 0', axis=1).values.ravel()
X_test_gene = pd.read_csv("Khan_xtest.csv").drop('Unnamed: 0', axis=1).values.ravel()
y_test_gene = pd.read_csv("Khan_ytest.csv").drop('Unnamed: 0', axis=1).values.ravel()

Then I tried the following lines of code to train my model:

from sklearn.svm import SVC
svm_gene = SVC(C=10, kernel='linear')
svm_gene.fit(X_train_gene, y_train_gene)

But I get a value error:

ValueError: Expected 2D array, got 1D array instead: array=[ 0.7733437 -2.438405 -0.4825622 ... -1.115962 -0.7837286 -1.339411 ]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

The full error message can be viewed in the following image:

The Error Message

How can I fix this issue?


Solution

  • I fixed the issue by using the following lines of code:

    X_train_gene = pd.read_csv("file_path/file.csv", header=None)
    #deleting the first column which contains numbers
    del X_train_gene[X_train_gene.columns[0]]
    #deleting the first row which contains strings and predictors' names
    X_train_gene = X_train_gene.iloc[1:]
    
    y_train_gene = pd.read_csv("file_path/file.csv", header=None)
    #deleting the first column which contains numbers
    del y_train_gene[y_train_gene.columns[0]]
    #deleting the first row which contains strings and predictors' names
    y_train_gene = y_train_gene.iloc[1:]
    
    X_test_gene = pd.read_csv("file_path/file.csv", header=None)
    #deleting the first column which contains numbers
    del X_test_gene[X_test_gene.columns[0]]
    #deleting the first row which contains strings and predictors' names
    X_test_gene = X_test_gene.iloc[1:]
    
    y_test_gene = pd.read_csv("file_path/file.csv", header=None)
    #deleting the first column which contains numbers
    del y_test_gene[y_test_gene.columns[0]]
    #deleting the first row which contains strings and predictors' names
    y_test_gene = y_test_gene.iloc[1:]
    
    
    #Converting the pandas data frames to numpy arrays as the SVC functions accepts numpy arrays as input.
    X_train_gene = X_train_gene.values
    
    y_train_gene = y_train_gene.values.ravel()
    
    X_test_gene = X_test_gene.values
    
    y_test_gene = y_test_gene.values.ravel()