I have a csv file in excel which contains the output from a BLAST search in the following format:
# BLASTN 2.2.29+
# Query: Cryptocephalus androgyne
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 1 hits found
Cryptocephalus ctg7180000094003 79.59 637 110 9 38 655 1300 1935 1.00E-125 444
# BLASTN 2.2.29+
# Query: Cryptocephalus aureolus
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 4 hits found
Cryptocephalus ctg7180000093816 95.5 667 12 8 7 655 1269 1935 0 1051
Cryptocephalus ctg7180000094021 88.01 667 62 8 7 655 1269 1935 0 780
Cryptocephalus ctg7180000094015 81.26 667 105 13 7 654 1269 1934 2.00E-152 532
Cryptocephalus ctg7180000093818 78.64 515 106 4 8 519 1270 1783 2.00E-94 340
I have imported this as a csv into python using
with open('BLASToutput.csv', 'rU') as csvfile:
contents = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in contents:
table = ', '.join(row)
What I now want to be able to do is extract columns of data as a list. My overall aim is to count all the matches which have over 98% identity (the third column).
The issue is that, since this is not in the typical csv format, there are no headers at the top so I cant extract a column based on its header. I was thinking if I could extract the third column as a list I can then use normal list tools in python to extract just the numbers I want but I have never used pythons csv module and I'm struggling to find an appropriate command. Other questions on SO are similar but dont refer to my specific case where there are no headers and empty cells. If you could help me I would be very grateful!
I managed to find one way based on:
Python: split files using mutliple split delimiters
import csv
csvfile = open("SANDoubleSuperMatrix.csv", "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
identity = []
for line in reader:
identity.append(line[2])
print identity