Search code examples
pythonpandasscikit-learnpca

sklearn/PCA - Error while trying to transform the high dimentional data


I encountered data error while trying to convert my high dimensional vector into 2 dimension using PCA.

This is my input data, each row has 300 dimensions:

                                                  vector
0      [0.01053525, -0.007869658, 0.0024931028, -0.04...
1      [-0.024436072, -0.016484523, 0.03859031, 0.000...
2      [0.015011676, -0.020465894, 0.004854744, -0.00...
3      [-0.010836455, -0.006562917, 0.00265073, 0.022...
4      [-0.018123362, -0.026007563, 0.04781856, -0.03...
...                                                  ...
45124  [-0.016111804, -0.041917775, 0.010192914, -0.0...
45125  [0.0311568, -0.013044083, 0.030656694, -0.0126...
45126  [-0.021875003, -0.005635035, 0.0076896898, -0....
45127  [-0.0062000924, -0.041035958, 0.0077403532, 0....
45128  [0.007794927, 0.0019561667, 0.15995999, -0.054...

[45129 rows x 1 columns]

My Code:

data = pd.read_parquet('1.parquet', engine='fastparquet')

reduced = pca.fit_transform(data)

Error:

TypeError                                 Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'list'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
<ipython-input-15-8e547411a212> in <module>
----> 1 reduced = pca.fit_transform(data)
...
...
ValueError: setting an array element with a sequence.

Edit

>>data.shape
(45129, 1)
>>data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45129 entries, 0 to 45128
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   vector  45129 non-null  object
dtypes: object(1)
memory usage: 352.7+ KB



Solution

  • Scikit-learn doesn't know how to handle a column that contains an array (list), so you'll need to expand the column. Since each row has an array of the same size, you can do this fairly easily with only 45,000 rows. Once you expand your data, you should be fine.

    import pandas as pd
    from sklearn.decomposition import PCA
    ​
    df = pd.DataFrame({"a": [[0.01, 0.02, 0.03], [0.04, 0.4, 0.1]]})
    expanded_df = pd.DataFrame(df.a.tolist())
    expanded_df
    0   1   2
    0   0.01    0.02    0.03
    1   0.04    0.40    0.10
    
    pca = PCA(n_components=2)
    reduced = pca.fit_transform(expanded_df)
    reduced
    array([[ 1.93778224e-01,  1.43048962e-17],
           [-1.93778224e-01,  1.43048962e-17]])