Search code examples
pythonpandaspca

Data type not understood when fitting PCA


I am carrying out principal component analysis for dimensionality reduction on the features in my dataset. However, I keep encountering this error message when I try and fit my model to my features:

TypeError: data type not understood

This is the code I have:

a = dat.iloc[:,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
         ,16,17,18,19,20,21,22,23,24,25,26,27,28,29,
         31,32]]
scaler = MinMaxScaler(feature_range=(0, 1))
rescaled = scaler.fit_transform(a)

Here is a sample of the data under a:

    Time    V1         V2         V3          V4           V5         V6          V7          V8           V9       ...   V22         V23          V24        V25         V26         V27    V28    Amount  Hours   Fraudulent
0   0.0 -1.359807   -0.072781   2.536347    1.378155    -0.338321   0.462388    0.239599    0.098698    0.363787    ... 0.277838    -0.110474   0.066928    0.128539    -0.189115   0.133558    -0.021053   149.62  0   0.206
1   0.0 1.191857    0.266151    0.166480    0.448154    0.060018    -0.082361   -0.078803   0.085102    -0.255425   ... -0.638672   0.101288    -0.339846   0.167170    0.125895    -0.008983   0.014724    2.69    0   0.206
2   1.0 -1.358354   -1.340163   1.773209    0.379780    -0.503198   1.800499    0.791461    0.247676    -1.514654   ... 0.771679    0.909412    -0.689281   -0.327642   -0.139097   -0.055353   -0.059752   378.66  0   0.206
3   1.0 -0.966272   -0.185226   1.792993    -0.863291   -0.010309   1.247203    0.237609    0.377436    -1.387024   ... 0.005274    -0.190321   -1.175575   0.647376    -0.221929   0.062723    0.061458    123.50  0   0.206

Here is the output of a.dtypes:

Time           float64
V1             float64
V2             float64
V3             float64
V4             float64
V5             float64
V6             float64
V7             float64
V8             float64
V9             float64
V10            float64
V11            float64
V12            float64
V13            float64
V14            float64
V15            float64
V16            float64
V17            float64
V18            float64
V19            float64
V20            float64
V21            float64
V22            float64
V23            float64
V24            float64
V25            float64
V26            float64
V27            float64
V28            float64
Amount         float64
Hours         category
Fraudulent     float64

Solution

  • In general, scikit-learn is designed to work with numeric datatypes (integers and floats). Often in pandas you'll have category, objects (dtype('O')), datetime64, timedelta64, or other non-numeric types. Pandas is designed for analysis, so will "do the right thing" with these types. Scikit needs to perform linear algebra operations, and how you represent the data numerically effects the linear algebra. For this reason, the decision of how to do this conversion is usually the responsibility of the analyst rather than the library.

    For the data types in this example, you will need to make an explicit decision about how to represent them numerically for scikit-learn.

    For example, for a categorical dtype, you could do one-hot encoding using pandas get_dummies function. This will create a new column for every possible value in the original column and have a 1 if the column was that value, and a 0 if not:

    In [2]: import pandas as pd
    
    In [3]: s = pd.Series(['a', 'b', 'c', 'a'], dtype='category')
    
    In [4]: s
    Out[4]:
    0    a
    1    b
    2    c
    3    a
    dtype: category
    Categories (3, object): [a, b, c]
    
    In [5]: pd.get_dummies(s)
    Out[5]:
       a  b  c
    0  1  0  0
    1  0  1  0
    2  0  0  1
    3  1  0  0
    

    Here's how this might look in your example:

    a = a.drop('Hours', axis=1).join(pd.get_dummies(a.Hours))
    

    However, in this case I would expect hours to be more naturally represented just as a float or integer. So, instead your could do:

    a.Hours = a.Hours.astype(float)