I am carrying out principal component analysis for dimensionality reduction on the features in my dataset. However, I keep encountering this error message when I try and fit my model to my features:
TypeError: data type not understood
This is the code I have:
a = dat.iloc[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
31,32]]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaled = scaler.fit_transform(a)
Here is a sample of the data under a:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 ... V22 V23 V24 V25 V26 V27 V28 Amount Hours Fraudulent
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 ... 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0 0.206
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 ... -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0 0.206
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 ... 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0 0.206
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 0.377436 -1.387024 ... 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 123.50 0 0.206
Here is the output of a.dtypes:
Time float64
V1 float64
V2 float64
V3 float64
V4 float64
V5 float64
V6 float64
V7 float64
V8 float64
V9 float64
V10 float64
V11 float64
V12 float64
V13 float64
V14 float64
V15 float64
V16 float64
V17 float64
V18 float64
V19 float64
V20 float64
V21 float64
V22 float64
V23 float64
V24 float64
V25 float64
V26 float64
V27 float64
V28 float64
Amount float64
Hours category
Fraudulent float64
In general, scikit-learn
is designed to work with numeric datatypes (integers and floats). Often in pandas you'll have category
, objects (dtype('O')
), datetime64
, timedelta64
, or other non-numeric types. Pandas is designed for analysis, so will "do the right thing" with these types. Scikit needs to perform linear algebra operations, and how you represent the data numerically effects the linear algebra. For this reason, the decision of how to do this conversion is usually the responsibility of the analyst rather than the library.
For the data types in this example, you will need to make an explicit decision about how to represent them numerically for scikit-learn
.
For example, for a categorical
dtype, you could do one-hot encoding using pandas get_dummies function. This will create a new column for every possible value in the original column and have a 1
if the column was that value, and a 0
if not:
In [2]: import pandas as pd
In [3]: s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')
In [4]: s
Out[4]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): [a, b, c]
In [5]: pd.get_dummies(s)
Out[5]:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
Here's how this might look in your example:
a = a.drop('Hours', axis=1).join(pd.get_dummies(a.Hours))
However, in this case I would expect hours to be more naturally represented just as a float or integer. So, instead your could do:
a.Hours = a.Hours.astype(float)