Search code examples
pythonpython-3.xscikit-learnpca

Issue with Scikit-learn data analysis


am attempting to take a .dat file of about 90,000 data lines of two variables (wavelength and intensity) and apply a sklearn.pca filter to it.

Here is a small set of that data:

wavelength                intensity
   [um]                 [W/m**2/um/sr]
196.078431372549       1.108370393265022E-003
192.307692307692       1.163428008597600E-003
188.679245283019       1.223639983609668E-003

The code I am using to analyze the data is below

pca= PCA(n_components=2)
pca.fit(data)
print(pca.components_)

The error code I get is this when I try to apply 2 pca components to one of the data sets:

ValueError: Datatype coercion is not allowed

Any help resolving would be much appreciated


Solution

  • I think in your case, the problem is the column name, especially [W/m**2/um/sr].

    Also when using PCA, do not forget to rescale the input variables into "comparable" units using StandardScaler.

    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    
    data = pd.DataFrame({'wavelength [um]': [196.078431372549, 1.108370393265022E-003, 192.307692307692], 'intensity [W/m**2/um/sr]': [1.163428008597600E-003, 188.679245283019, 1.223639983609668E-003]})
    
    scaler = StandardScaler(with_mean=True, with_std=True)
    pca= PCA(n_components=2)
    pca.fit(scaler.fit_transform(data))
    print(pca.components_)
    

    Worked well for me. Maybe you just need to specify:

    data.columns = data.columns.astype(str)