Search code examples
pythonpython-2.7bioinformaticstext-manipulation

Python - Comparing files delimiting characters in line


there. I'm a begginer in python and I'm struggling to do the following:

I have a file like this (+10k line):

EgrG_000095700 /product="ubiquitin carboxyl terminal hydrolase 5"
EgrG_000095800 /product="DNA polymerase epsilon subunit 3"
EgrG_000095850 /product="crossover junction endonuclease EME1"
EgrG_000095900 /product="lysine specific histone demethylase 1A"
EgrG_000096000 /product="charged multivesicular body protein 6"
EgrG_000096100 /product="NADH ubiquinone oxidoreductase subunit 10"

and this one (+600 lines):

EgrG_000076200.1
EgrG_000131300.1
EgrG_000524000.1
EgrG_000733100.1
EgrG_000781600.1
EgrG_000094950.1

All the ID's of the second file are in the first one,so I want the lines of the first file corresponding to ID's of the second one.

I wrote the following script:

f1 = open('egranulosus_v3_2014_05_27.tsv').readlines()
f2 = open('eg_es_final_ids').readlines()
fr = open('res.tsv','w')

for line in f1:
     if line[0:14] == f2[0:14]:
        fr.write('%s'%(line))

fr.close()
print "Done!"

My idea was to search the id's delimiting the characters on each line to match EgrG_XXXX of one file to the other, an then, write the lines to a new file. I tried some modifications, that's just the "core" of my idea. I got nothing. In one of the modifications, I got just one line.


Solution

  • with open('egranulosus_v3_2014_05_27.txt', 'r') as infile:
        line_storage = {}
        for line in infile:
            data = line.split()
            key = data[0]
            value = line.replace('\n', '')
            line_storage[key] = value
    
    with open('eg_es_final_ids.txt', 'r') as infile, open('my_output.txt', 'w') as outfile:
        for line in infile:
            lookup_key = line.split('.')[0]
            match = line_storage.get(lookup_key)
            outfile.write(''.join([str(match), '\n']))