I have a sample input file that has many rows of all variants, and columns represent the number of components.
A01_01 A01_02 A01_03 A01_04 A01_05 A01_06 A01_07 A01_08 A01_09 A01_10 A01_11 A01_12 A01_13 A01_14 A01_15 A01_16 A01_17 A01_18 A01_19 A01_20 A01_21 A01_22 A01_23 A01_24 A01_25 A01_26 A01_27 A01_28 A01_29 A01_30 A01_31 A01_32 A01_33 A01_34 A01_35 A01_36 A01_37 A01_38 A01_39 A01_40 A01_41 A01_42 A01_43 A01_44 A01_45 A01_46 A01_47 A01_48 A01_49 A01_50 A01_51 A01_52 A01_53 A01_54 A01_55 A01_56 A01_57 A01_58 A01_59 A01_60 A01_61 A01_62 A01_63 A01_64 A01_65 A01_66 A01_67 A01_69 A01_70 A01_71
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
0 1 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 0 1 1 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1
I first import this .txt file as:
#!/usr/bin/env python
from sklearn.decomposition import PCA
inputfile=vcf=open('sample_input_file', 'r')
I would like to performing principal component analysis and plotting the first two components (meaning the first two columns)
I am not sure if this the way to go about it after reading about
sklearn
PCA for two components:
pca = PCA(n_components=2)
pca.fit(inputfile) #not sure how this read in this file
Therefore, I need help importing my input file as a dataframe for Python to perform PCA on it
sklearn
works with numpy arrays.
So you want to use numpy.loadtxt
:
data = numpy.loadtxt('sample_input_file', skiprows=1)
pca = PCA(n_components=2)
pca.fit(data)