I would like to preserve the column Unique_id
of my dataframe my_df
to facilitate further analysis after applying PCA. But during dimensionality reduction the column will be lost. Is there a way to keep the column?
my_df
looks like this:
Unique_id EndValue Peak_val Score f1 f2 f3 f4 f5 f6
0 15 44.5 46.5 377.17 38.366667 17.757471 -0.610802 -1.028477 45.372727 0.150168
1 15 45.0 47.0 1268.37 46.909091 0.090909 -2.846050 6.100000 46.909091 0.090909
2 18 45.0 47.0 373.16 45.030303 0.030303 5.480078 28.031250 45.651685 0.252298
3 18 45.0 47.0 369.68 45.000000 0.000000 0.000000 -3.000000 46.052632 0.052632
4 19 45.0 47.0 1414.97 46.000000 0.000000 0.000000 -3.000000 46.000000 0.000000
I followed method suggested by this thread to preserve the index of the original data and attach it to the my_df
after applying PCA.
During PCA, I removed the Unique_id
column as it is not a feature. I then tried:
def pca2(data, n, pc_count = None):
return PCA(n_components = n).fit_transform(data)
results_2d = pca2(my_df, 2)
df_temp = pd.DataFrame(results_2d, index=my_df.index)
df_temp
Output looked like this:
0 1
0 -1.863714e+03 -14.793301
1 -2.754914e+03 -10.330997
2 -1.859704e+03 23.473387
3 -1.856224e+03 5.703049
4 -2.901514e+03 -19.540132
5 -1.786054e+03 17.621220
6 -2.555565e+03 38.636828
7 -1.667134e+03 11.647753
I realise that it's hard to verify that the indexing is correct. Is there a way I can check my results? And is there a better way to preserve the actual Unique_id
?
You should use ColumnTransformer
:
>>> import pandas as pd
>>> from sklearn.compose import ColumnTransformer
>>> from sklearn.decomposition import PCA
>>> df = pd.DataFrame({'c1': [1, 2, 3, 4],
... 'c2': [3., 5.5, 8., 10.5],
... 'c_to_preserve': [-5, -3, 6, 10]})
>>> featurizer = ColumnTransformer([('pca', PCA(n_components=1), ['c1', 'c2']),
... ('preserve', 'passthrough', ['c_to_preserve'])])
>>> featurizer.fit_transform(df)
array([[ 4.03887361, -5. ],
[ 1.3462912 , -3. ],
[-1.3462912 , 6. ],
[-4.03887361, 10. ]])
Check sklearn.compose.ColumnTransformer
for more information.