Search code examples
pythonfilecomparison

How can I compare files quicker in Python?


Is there any way to make this script faster? I'm using one file to compare another file to print lines, if second column are equal.

import csv
output =[]
a = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf', 'r')
list1 = a.readlines()
reader1 = a.read()
b = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf', 'r')
list2 = b.readlines()
reader2 = b.read()

f3 = open('/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf', 'w')

for line1 in list1:
        separar = line1.split("\t")
        gene = separar[2]
        for line2 in list2:
        separar2 = line2.split("\t")
                gene2 = separar2[2]
        if gene == gene2:
                        print line1
                        f3.write(line1)

Input example (for both files):

1   14107321    rs187821037 C   T   100 PASS    AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout

1   14107321    rs187821037 C   T   100 PASS    AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout

1   14107321    rs187821037 C   T   100 PASS    AA=C;SNPSOURCE=LOWCOV,EXOME;AN=2184;AVGPOST=0.9996;VT=SNP;THETA=0.0006;RSQ=0.7640;LDAF=0.0006;AC=1;ERATE=0.0003;AF=0.0005;AFR_AF=0.0020;STATUS=sample_dropout

The command line below works equally for same purpose in bash:

awk 'FNR==NR {a[$3]; next} $3 in a' Neandertais.vcf Phase1_missing.vcf > teste.vcf

How can I improve this Python script?


Solution

  • If you store your lines in dictionaries that are keyed by the column that you are interested in, you can easily use Python's built-in set functions (which run at C speed) to find the matching lines. I tested a slightly modified version of this (filenames changed, and changed split('\t') to split() because of stackoverflow formatting) and it seems to work fine:

    import collections
    
    # Use 'rb' to open files
    
    infn1 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Phase1_missing.vcf'
    infn2 = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais.vcf'
    outfn = '/home/lucas/Doutorado/Projeto Eduardo/Exoma Neandertal/Listas_eduardo/Neandertais_and_YRI.vcf'
    
    def readfile(fname):
        '''
        Read in a file and return a dictionary of lines, keyed by the item in the second column
        '''
        results = collections.defaultdict(list)
        # Read in binary mode -- it's quicker
        with open(fname, 'rb') as f:
            for line in f:
                parts = line.split("\t")
                if not parts:
                    continue
                gene = parts[2]
                results[gene].append(line)
        return results
    
    dict1 = readfile(infn1)
    dict2 = readfile(infn2)
    
    with open(outfn, 'wb') as outf:
        # Find keys that appear in both files
        for key in set(dict1) & set(dict2):
            # For these keys, print all the matching
            # lines in the first file
            for line in dict1[key]:
                print(line.rstrip())
                outf.write(line)