I have 2 (pretty big, ~15k lines) csv tables in the following formats:
Disease/Trait Mapped_gene p-Value
Wegener's granulomatosis HLA-DPB1 2.00E-50
Wegener's granulomatosis TENM3 - DCTD 2.00E-06
Brugada syndrome SCN5A 1.00E-14
Brugada syndrome SCN10A 1.00E-68
Brugada syndrome HEY2 - NCOA7 5.00E-17
Major depressive disorder IRF8 - FENDRR 3.00E-07
Identifier Homologues Symbol
CG11621 5286 HEY2
CG11621 5287 IRF8
CG11621 5287 PIK3C2B
CG11621 5288 PIK3C2G
CG11621 5288 PIK3C2G
CG11949 2035 DCTD
CG11949 2035 EPB41
CG11949 2036 EPB41L1
CG11949 2037 EPB41L2
I'd like to use Python to compare the tables such that if any of the "Symbol" column from table 2 match "Mapped_gene" from table 1, the matching rows from each table can be merged together and put into an output file.
I've tried using the Pandas plugin, but couldn't make it work. Has anyone got any better ideas?
Thanks.
This should work as you want:
import csv
diseases = {}
# Load the disease file in memory
with csv.reader(open('table1.csv', 'rb')) as dfile:
# Skip the header
dfile.next()
for disease, gene, pvalue in dfile:
diseases[gene] = (disease, pvalue)
with csv.reader(open('table2.csv', 'rb')) as idfile, csv.writer(open('output.csv', 'wb')) as output:
# Skip the header
idfile.next()
for ident, homologue, symbol in idfile:
if symbol in diseases:
output.writerow((ident, homologue, symbol) + diseases[symbol])
It assumes that every gene name under Mapped_gene
is unique, though. It can be easily extended to cope with duplicates, otherwise.