I have been using scikit-learn's linear svc model for a binary classification problem.
Example row from the dataset:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 NaN S
I transformed the data into numbers using the OneHotEncoder and the ColumnTransformer from scikit-learn:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
categorical_features = ["Name", "Sex", "Ticket", "Cabin", "Embarked"]
encoder = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
encoder,
categorical_features)],
remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X
It returned me a scipy.sparse._csr.csr_matrix, so I changed it into a dataframe using:
transformed_X = pd.DataFrame(transformed_X)
Then I resplit the data and fit it to the model
transformed_X_train, transformed_X_test, y_train, y_test = train_test_split(transformed_X,
y,
test_size=0.2)
from sklearn import svm
clf = svm.SVC()
clf.fit(transformed_X_train, y_train)
Unfortunately, I got an error:
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a real number, not 'csr_matrix'
...
ValueError: setting an array element with a sequence.
I tried searching online, but I can't didn't find a good solution to my problem. Can someone please help, because I don't know what I'm doing wrong. Any help would be appreciated :)
I got it! I first filled in the missing data that was in the dataframe before encoding it, then when I one-hot-encoded it I did it with the entire training set, not only the X, like so:
transformed_X = transformer.fit_transform(train)
transformed_X
The difference between the X and the full training set is that X was the training set without the target values (In this case, it was whether they survived or not).
Thanks! :)