X_train
------------------------------------------------------------------------------------------
| bias | word.lower | word[-3:] | word.isupper | word.isdigit | POS | BOS | EOS |
------------------------------------------------------------------------------------------
0 | 1.0 | headache, | HE, | True | False | NNP | True | False |
1 | 1.0 | mostly | tly | False | False | NNP | False | False |
2 | 1.0 | but | BUT | True | False | NNP | False | False |
...
...
...
y_train
------------
| OBI |
------------
0 | B-ADR |
1 | O |
2 | O |
...
...
...
I'm trying to do Name Entity Recognition (NER) with Decision Tree. My features dataframe and label dataframe look like the above. When I run the following code, it returns ValueError: could not convert string to float: 'headache,'
. Are my data in the proper form (I'm following this tutorial)? Do features have to be float numbers for multiclass-classification by Decision Tree? If so, how should I proceed the OBI labeling, given that most token features, if not all, are either string or Boolean?
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
DT = DecisionTreeClassifier()
DT.fit(X_train, y_train)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-15-aa02be64ac27> in <module>
1 DT = DecisionTreeClassifier()
----> 2 DT.fit(X_train, y_train)
d:\python\lib\site-packages\sklearn\tree\_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
888 """
889
--> 890 super().fit(
891 X, y,
892 sample_weight=sample_weight,
d:\python\lib\site-packages\sklearn\tree\_classes.py in fit(self, X, y, sample_weight, check_input, X_idx_sorted)
154 check_X_params = dict(dtype=DTYPE, accept_sparse="csc")
155 check_y_params = dict(ensure_2d=False, dtype=None)
--> 156 X, y = self._validate_data(X, y,
157 validate_separately=(check_X_params,
158 check_y_params))
d:\python\lib\site-packages\sklearn\base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
427 # :(
428 check_X_params, check_y_params = validate_separately
--> 429 X = check_array(X, **check_X_params)
430 y = check_array(y, **check_y_params)
431 else:
d:\python\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
d:\python\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
596 array = array.astype(dtype, casting="unsafe", copy=False)
597 else:
--> 598 array = np.asarray(array, order=order, dtype=dtype)
599 except ComplexWarning:
600 raise ValueError("Complex data not supported\n"
d:\python\lib\site-packages\numpy\core\_asarray.py in asarray(a, dtype, order)
83
84 """
---> 85 return array(a, dtype, copy=False, order=order)
86
87
ValueError: could not convert string to float: 'headache,'
Yes, they need to be numeric (not necessarily float). So if you have 4 distinct text labels in a column then you need to convert this to 4 numbers. To do this, use sklearn's labelencoder. If your data is in a pandas dataframe df
,
from sklearn import preprocessing
from collections import defaultdict
# select text columns
cat_cols = df.select_dtypes(include='object').columns
# this is a way to apply label_encoder to all category cols at once, returning a label encoder per categorical column, in a dict d
d = defaultdict(preprocessing.LabelEncoder)
# transform all text columns to numbers
df[cat_cols] = df[cat_cols].apply(lambda x: d[x.name].fit_transform(x.astype(str)))
Once you have converted all columns to numbers, you may also wish to "one-hot" encode. Do this for categorical and boolean columns (here I've shown it for your categorical columns only).
# you should probably also one-hot the categorical columns
df = pd.get_dummies(df, columns=cat_cols)
You can retrieve the names of the values from the label encoder afterwards using the dict d
of label encoders.
d[col_name].inverse_transform(value)
This tutorial is particularly useful for understanding these concepts.