Search code examples
pythonpython-3.xscikit-learnasciipca

Python: PCA issue with data analysis


I am attempting to do some data analysis with PCA sklearn package. The issue I'm currently running into is the way my code is analysing the data.

An example of some of the data is as follows

wavelength intensity ; [um] [W/m**2/um/sr] 196.078431372549 1.108370393265022E-003 192.307692307692 1.163428008597600E-003 188.679245283019 1.223639983609668E-003

The code written so far is as follows:

scaler = StandardScaler(with_mean=True, with_std=True) #scales the data

data_crescent=ascii.read('earth_crescent.dat',data_start=4958, data_end=13300, delimiter=' ')#where the data is being read


#where each variable comes from in the dat 
y_intensity_crescent=data_crescent['col2'][:]
x_wave_crescent=data_crescent['col1'][:]

standard_y_crescent=StandardScaler().fit_transform(y_intensity_crescent)#standardizing the intensity variable

#PCA runthrough of data 
pca= PCA(n_components=2)
principalCrescentY=pca.fit_transform(standard_y_crescent)
principalDfcrescent = pd.DataFrame(data = principalCrescentY
             , columns = ['principal component 1', 'principal component 2'])



finalDfcrescent = pd.concat([principalDfcrescent, [y_intensity_crescent]], axis = 1)

Once ran, the data produces this error:

    ValueError: Expected 2D array, got 1D array instead:

Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample

In order to analyze the data via PCA, the data needs to be transformed into a 2D model, to produce the expected results. Any work around would be much appreciated!


Solution

  • The problem is that you are giving one feature y_intensity_crescent to your pca object by doing: principalCrescentY=pca.fit_transform(standard_y_crescent). You are in fact giving only one dimension to your pca algorithm. Roughly: principal component analysis takes multiple features time series and will combine them into components which are combination of the features. If you want 2 components you need more than 1 features.

    Here is some example of how to use it properly: PCA tutorial using sklearn