I am using PCA in Python to reduce the dimensionality of the data I have. The current data has 768 rows and 10 columns.
I am using the following code to implement PCA:
import numpy as np
from sklearn import decomposition
demo_df = pd.read_csv('data.csv')
pca = decomposition.PCA(n_components=4)
comps = pca.fit(demo_df).transform(demo_df)
np.savetxt('data_reduced.csv', comps, delimiter=',')
According to my understanding the resultant file should contain 768 rows and 4 columns (because n_components =4).
But the resultant data has n-1 rows i.e 767.
Why is one row missing from the data?
Yes you are right in your understanding. But check the shape of demo_df before passing it to PCA. It must be of length 767. PCA is not eliminating any sample from your data whatsoever.
The difference arises from the usage of read_csv()
. Please have a look at the documentation of pandas.read_csv(). It has a parameter header
and its description is as following:
header : int or list of ints, default ‘infer’
Row number(s) to use as the column names, and the start of the data. Default behavior is as if set to 0 if no names passed, otherwise None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
It by defaults uses the first line of file as column headings if those headings are not provided explicitly by use of another parameter names
.
So if you dont want to use the first line of your file as column headers, you should pass the header = None
in read_csv() like this:
demo_df = pd.read_csv('data.csv', header = None)