Search code examples
pythonrfor-looppcasklearn-pandas

Performing PCA on a dataframe with Python with sklearn


I have a sample input file that has many rows of all variants, and columns represent the number of components.

A01_01  A01_02  A01_03  A01_04  A01_05  A01_06  A01_07  A01_08  A01_09  A01_10 A01_11   A01_12  A01_13  A01_14  A01_15  A01_16  A01_17  A01_18  A01_19  A01_20  A01_21  A01_22  A01_23  A01_24  A01_25  A01_26  A01_27  A01_28  A01_29  A01_30  A01_31  A01_32  A01_33  A01_34  A01_35  A01_36  A01_37  A01_38  A01_39  A01_40  A01_41  A01_42  A01_43  A01_44  A01_45  A01_46  A01_47  A01_48  A01_49  A01_50  A01_51  A01_52  A01_53  A01_54  A01_55  A01_56  A01_57  A01_58  A01_59  A01_60  A01_61  A01_62  A01_63  A01_64  A01_65  A01_66  A01_67  A01_69  A01_70  A01_71
0   1   0   0   1   1   1   1   1   0   0   0   0   0   0   0   1   1   0   1   1   1   0   1   0   1   0   0   1   0   1   0   0   0   0   0   0   1   1   1   0   1   0   0   0   0   1   0   1   1   0   1   1   0   0   1   1   1   1   1   1   1   1   0   0   1   0   0   0   1
0   1   0   0   1   1   1   1   1   0   0   0   0   0   0   0   1   1   0   1   1   1   0   1   0   1   0   0   1   0   1   0   0   0   0   0   0   1   1   1   0   1   0   0   0   0   1   0   1   1   0   1   1   0   0   1   1   1   1   1   1   1   1   0   0   1   0   0   0   1
0   1   0   0   1   1   1   1   1   0   0   0   0   0   0   0   1   1   0   1   1   1   0   1   0   1   0   0   1   0   1   0   0   0   0   0   0   1   1   1   0   1   0   0   0   0   1   0   1   1   0   1   1   0   0   1   1   1   1   1   1   1   1   0   0   1   0   0   0   1
0   1   0   0   1   1   1   1   1   0   0   0   0   0   0   0   1   1   0   1   1   1   0   1   0   1   0   0   1   0   1   0   0   0   0   0   0   1   1   1   0   1   0   0   0   0   1   0   1   1   0   1   1   0   0   1   1   1   1   1   1   1   1   0   0   1   0   0   0   1 
0   1   0   0   1   1   1   1   1   0   0   0   0   0   0   0   1   1   0   1   1   1   0   1   0   1   0   0   1   0   1   0   0   0   0   0   0   1   1   1   0   1   0   0   0   0   1   0   1   1   0   1   1   0   0   1   1   1   1   1   1   1   1   0   0   1   0   0   0   1
0   1   0   0   1   1   1   1   1   0   0   0   0   0   0   0   1   1   0   1   1   1   0   1   0   1   0   0   1   0   1   0   0   0   0   0   0   1   1   1   0   1   0   0   0   0   1   0   1   1   0   1   1   0   0   1   1   1   1   1   1   1   1   0   0   1   0   0   0   1

I first import this .txt file as:

#!/usr/bin/env python
from sklearn.decomposition import PCA

inputfile=vcf=open('sample_input_file', 'r')

I would like to performing principal component analysis and plotting the first two components (meaning the first two columns)

I am not sure if this the way to go about it after reading about

sklearn 

PCA for two components:

pca = PCA(n_components=2)
pca.fit(inputfile) #not sure how this read in this file

Therefore, I need help importing my input file as a dataframe for Python to perform PCA on it


Solution

  • sklearn works with numpy arrays.

    So you want to use numpy.loadtxt:

    data = numpy.loadtxt('sample_input_file', skiprows=1)
    pca = PCA(n_components=2)
    pca.fit(data)