When I run the following code in Jupyter Lab
import numpy as np
from sklearn.feature_selection import SelectKBest,f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
selector = SelectKBest(f_classif,k=5)
selector.fit(titanic[predictors],titanic["Survived"])
Then it went errors and note that ValueError: could not convert string to float: 'Mme'
,details are like these:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_17760/1637555559.py in <module>
5 predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
6 selector = SelectKBest(f_classif,k=5)
----> 7 selector.fit(titanic[predictors],titanic["Survived"])
......
ValueError: could not convert string to float: 'Mme'
I tried to print titanic[predictors]
and titanic["Survived"]
,then the details are follows:
Pclass Sex Age SibSp Parch Fare Embarked FamilySize Title NameLength
0 3 0 22.0 1 0 7.2500 0 1 1 23
1 1 1 38.0 1 0 71.2833 1 1 3 51
2 3 1 26.0 0 0 7.9250 0 0 2 22
3 1 1 35.0 1 0 53.1000 0 1 3 44
4 3 0 35.0 0 0 8.0500 0 0 1 24
... ... ... ... ... ... ... ... ... ... ...
886 2 0 27.0 0 0 13.0000 0 0 6 21
887 1 1 19.0 0 0 30.0000 0 0 2 28
888 3 1 28.0 1 2 23.4500 0 3 2 40
889 1 0 26.0 0 0 30.0000 1 0 1 21
890 3 0 32.0 0 0 7.7500 2 0 1 19
891 rows × 10 columns
0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 0
889 1
890 0
Name: Survived, Length: 891, dtype: int64
How to Solve this Problem?
When you are trying to fit some algorithm (in your case SelectKBest
), you need to be aware of your data. And, almost all time you need to preprocess it.
Take a look to your data:
Most of algorithm don't accept categorical features, and you will need to make a transformation to numerical one (evaluate the use of OneHotEncoder
).
In your case it seems you have a categorical value called Mme
, which is in the feature Title
. Check it.
You will have the same problem with NaN values.
In conclusion, before start fitting, you have to preprocess your data.