Search code examples
pythondataframesvmscalingtraining-data

Issue with SVC Classifier on CSV file with Array Values in "features" Column


Hello Stack Overflow community,

I am facing an issue while trying to apply the Support Vector Classifier (SVC) on a CSV file. Here is the link to CSV File. Download this file for proper view. This file has two columns: "features" and "labels". The "features" column contains array (vector) values, which are quite lengthy, and the "labels" column has two classes: "Controlled" and "Abnormal". However, I'm encountering a ValueError with the message "could not convert string to float."

Here is a snippet of my code:

X = feature_df_wav2vec['features'].apply(lambda x: np.array(x).reshape(-1, 1))
y = feature_df_wav2vec['label']
#X = X.astype(float)

label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)
X

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))

X_train_scaled = X_train.copy()
X_train_scaled = scaler.fit_transform(np.vstack(X_train).T).flatten() #error-here

X_test_scaled = X_test.copy()
X_test_scaled = scaler.transform(np.vstack(X_test).T).flatten()

svm_classifier = SVC(kernel='linear', C=1.0)
svm_classifier.fit(X_train, y_train)

X_train_scaled = scaler.fit_transform(np.vstack(X_train).T).flatten() #facing error from this line I have tried methods like scaling, converting data types, etc., but none have resolved the issue. Could someone please guide me on how to properly preprocess the "features" column before fitting the SVC model?


Solution

  • The problem is in the first line :

    X = feature_df_wav2vec['features'].apply(lambda x: np.array(x).reshape(-1, 1))
    

    Use np.fromstring to convert features to np.array:

    X = feature_df_wav2vec['features'].apply(lambda x: np.fromstring(x[1:-1], sep=' ')).values
    

    Full code :

    X = feature_df_wav2vec['features'].apply(lambda x: np.fromstring(x[1:-1], sep=' ')).values
    y = feature_df_wav2vec['label']
    
    label_encoder = LabelEncoder()
    y_encoded = label_encoder.fit_transform(y)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.3, random_state=42)
    X
    
    from sklearn.preprocessing import MinMaxScaler
    scaler = MinMaxScaler(feature_range=(0,1))
    
    X_train_scaled = X_train.copy()
    X_train_scaled = scaler.fit_transform(np.vstack(X_train)).flatten() #error-here
    
    X_test_scaled = X_test.copy()
    X_test_scaled = scaler.transform(np.vstack(X_test)).flatten()
    
    svm_classifier = SVC(kernel='linear', C=1.0)
    svm_classifier.fit(np.vstack(X_train), y_train)