Search code examples
pythonpython-2.7bioinformaticsfastatext-manipulation

Print match and line after match


I have this file containing 82 pairs of IDs:

EmuJ_000063620.1    EgrG_000063620.1    253 253
EmuJ_000065200.1    EgrG_000065200.1    128 128
EmuJ_000081200.1    EgrG_000081200.1    1213    1213
EmuJ_000096200.1    EgrG_000096200.1    295 298
EmuJ_000114700.1    EgrG_000114700.1    153 153
EmuJ_000133800.1    EgrG_000133800.1    153 153
EmuJ_000139900.1    EgrG_000144400.1    2937    2937
EmuJ_000164600.1    EgrG_000164600.1    167 167

and I have two other files with the sequences for EmuJ_* IDs and EgrG_* IDs as follows:

EgrG_sequences.fasta:

>EgrG_000632500.1
MKKKSHRKSPEGNHSLTKAANKDTAKCNEERGRNIGQSNEEENATRSEKDREGDEDRNLREYVISIAQKYYPHLVSCMRQDDDNQASADARGADGANDEEHCPKHCPRLNAQKYYLYSATCNHHCEDSQASCDEEGDGKRLLKQCLLWLTERYYPSLAARIRQCNDDQASSNAHGADETDDGDRRLKQALLLFAKKLYPCVTTCIRHCVADHTSHDARGVDEEVDGEQLLKQCLHSSAQKFYPRLAACVCHCDADHASTETCGALGVGNAERCPQQCPCLCAQQYYVQSATCVHHCDNEQSSPETRGVKEDVDVEQLLKQCLLMFAEKFHPTLAAGIRSCADDESSHVASVEGEDDADKQRLKQYLLLFAQKYYPHLIAYIQKRDDDQSSSSVRDSGEEANEEEERLKQCLLLFAQKLYPRLVAYTGRCDSNQSTSDGCSVDGEEAEKHYLKQSLLLLAQKYYPSLAAYLRQFDDNQSSSDVRSVDEEEAEKRHLKQGLLFFAEKYYPSLATYIRRCDDDQSSSDARVVDEVDDEDRRLKQGLLLLAQKYYPPLANYIRHSQSSFNVCGADEKEDEEHCLNQLPRLCAQEAYIRSSSCSHHCDDDQASNDTLVVDKEEEEKYRLKQGLLLLAQKFYPPLATCIHQCDDQSSHDTRGVDEEEAEEQLLKKCLLMFAEKFYPSLAATIHHSVYDQASFDMRDVDTENDETHCLSLSAENYSTASTTCIHHSDGDQSTSDACGVEEGDVEEQRLKRGLLLLAQKYYPSLAAYICQCDDYQPSSDVCGVGEEDTGEERLKQCLLLFAKKFYPSLASRNSQCGDNLILNDEVVGETVINSDTDTDEVTPVEKSTAVCDEVDEVPFKYVGSPTPLSDVDVDSLEKVIPPNDLTAHSSFQNSLDHSVEGGYPDRAFYIGRHTVESADSTAPLSKSSSTKLYFSNTDEFPTEEEVSSPIAPLSIQRRIRIYLEDLENVRKVSLIPLCKTDKFGNPQEEIIIDSNLDDDTDESKLSSVDVEFTMEQADATPLDLEAQDEDLKNCVAIILKHIWSELMECIRREGLSDVYELSLGDRRIEVPQDDVCLVR*
>EgrG_000006700.1
MTDTKGPDESYFEKEAFSSLPQPVDSPSASATDTDRIPVVAVSLPVSSGSIDVNCNCSCYLIICETKLIIDYQMTRKW*

and so on. The same for EmuJ_sequences.fasta I need to get the sequences for each pair and write one after the other maintaining the order like this:

>EmuJ_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYENQCRLDTEVKRLCSNIQTFNRQVDMWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EgrG_000063620.1
AEPGSGDFDANALRDLANEHQRRVQQKQADLETYELQVLDSVLELTSQLSLNLNEKISKAYDNQCRLDTEVKRLCSNIQTFNCQVDLWNKEILDINSALKELGDAETWSQKLCRDVQIIHDTLQAADK*
>EmuJ_000065200.1
MLCLITPFPSVVPVCVRTCVCMCPCPLLLILYTWSAYLVPFSLPLCLYAHFHIRFLPPFSSLSIPRFLTHSLFLPSYPPLTMLRMKKSLAPCPAERR*
>EgrG_000065200.1
MLCLVTSFPSAVPVCMRTCVCMCSCPLLLILYTWSAYLVPFSLPLCLYTHLHIRFLPPFPSLAIPRFLTHPLFLPTSLYVADKKEPSAMPRRASLRQMLLIVLLQELH*
>EmuJ_000081200.1
MNSLRIFAVVITCLMVVGFSYSIHPTFPSYQSVVWHSSANTGYECRDGICGYRCSNPWCHGFGSILHPQMGVQEMWGSAAHGRHAHSRAMTEFLAKASPEDVTMLIESTPNIDEVITSLDGEAVTILINKLPNLRRVMEELKPQTKMHIVSKLCGKVGSAMEWTEARRNDGSGMWNEYGSGWEGIDAIQDLEAEVIMRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPTPATSVSSMDPLVLRSIILNMPNLNDILMQVDPVYLQSALVHVPGFGAYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQRMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPRAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPTRFANVLLHVKPDYVRYIIEKHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPRVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAATASATEEPTVTVDYANLLRSKIPLIDNVIKMSDPEKVAILRDNLLDVSRILVNLDPTMLRNINSIIFNATKMLNELSVFLVEYPLEYLHKEGKSGVAVNKSEQVGTTGENGVSSIAVEKLQMVLLKIPLFDQFLKWIDQKKLHELLNKIPTLLEVIATANQETLDKINSLLHDAIATMNTAKKLIVTGICRKLAEEGKLRLPRVCPSAST*
>EgrG_000081200.1
MNLLRIFAVVITCLIVVGFGYPTHPTFPSYQTAVWHSSANTGYRCRAGICGYRCSSPWCHGFESALYPQMAMQEMWGSGAHGRHAHSRTMTEFLMKASPEDLTMLIESTPNIDEVITSLDSEAIIILINKLPNLRRVMEKLKPQTKMHIVSKLCDKVGNAMEWAGARRNDGSGMWNEYGSVWEGIDAIQDLEAEMITRCVQDCGYCAHPTMDGGYVFDPIPIKDVAVYDDSMNWQPQLPMPATLVSNMDPHVLRSIILNMPNLDDILMQVDPVHLQSALMYVPGFGTYASSMDAYTLHSMIVGLPYVRDIVASMDARLLQWMIAHIPNIDAILFGGNAVISQPTMPDMPRKAPKAEEPDAKTTEVAGGMSDEANIMDRKFMEYIISTMPNVPARFANVLLHVKPDYVRYIIENHGNLHGLLAKMNAQTLQYVIAHVPKFGVILSNMNRNTLKVVFDKLPNIAKFLADMNPNVVRAIVAKLPSLAKYTPTDPTTTALPTSVTLVPELGTEFSSYAPTASVTEASMVTVDYAHLLRSKIPLIDNVIKMSDPAKVAILRDNLLDVGTTDENGVSSITVEKLQMVLLKIPLFDQFLNWIDSKKLHALLQKIPTLLEVIATANQEALDKINLLLHDAIATMNTAKKLIVTSICRKLAEEGKLRLPRVCPSTST*

And so on.

I wrote a script in bash to do this and it worked like I wanted, it was very simple. Now I'm trying to do the same in Python (which I'm learning), but I'm having a hard time to do the same in a pythonic way. I've tried this, but I've got only the first pair and then it stopped:

rbh=open('rbh_res_eg-not-sec.txt', 'r')
ems=open('em_seq.fasta', 'r')
egs=open('eg_seq.fasta', 'r')

for l in rbh:
    emid=l.split('\t')[0]
    egid=l.split('\t')[1]
    # ids=emid+'\n'+egid 
    # print ids # just to check if split worked
    for lm in ems:      
        if emid in lm:
            print lm.strip()
            print next(ems).strip()
    for lg in egs:
        if egid in lg:
            print lg.strip()
            print next(egs).strip()

I've tried some variations but I've got only the IDs, without the sequences. So, how can I find the ID in the sequence file, print it and the line after it (the line with sequence referring to the ID)? Please, let me know if I explained it clearly.


Solution

  • Iterating over a file moves the file pointer until it reaches the end of the file (the last line), so after the first iteration of your outer loop, the ems and egs files are exhausted.

    The quick&dirty workaround would be to reset the ems and egs pointers to zero at the end of the outer loop, ie:

    for line in rbh:
        # no need to split twice
        parts = line.split("\t") 
        emid, egid = parts[0].strip(), parts[1].strip()
    
        for lm in ems:      
            if emid in lm:
                print lm.strip()
                print next(ems).strip()
        ems.seek(0) # reset the file pointer
    
        for lg in egs:
            if egid in lg:
                print lg.strip()
                print next(egs).strip()
        egs.seek(0) # reset the file pointer
    

    Note that calling next(iterator) while already iterating over iterator will consume one more item of the iterator, as illustrated here:

    >>> it = iter(range(20))
    >>> for x in it:
    ...     print x, next(it)
    ... 
    0 1
    2 3
    4 5
    6 7
    8 9
    10 11
    12 13
    14 15
    16 17
    18 19
    

    As you can see, we don't iter on each element of our range here... Given your file format it should not be a huge problem but I thought I'd still warn you about it.

    Now your algorithm is far from efficient - for each line of the rbh file it will read scan the whole ems and egs files again and again.

    _NB : the following assumes that each emid / egid will appear at most once in the fasta files._

    If your ems and egs files are not too large and you have enough available memory, you could load them into a pair of dicts and do a mere dict lookup (which is O(1) and possibly one of the most optimized operation in Python)

    # warning: totally untested code
    
    def fastamap(path):
        d = dict()
        with open(path) as f:
            for num, line in enumerate(f, 1):
                line = line.strip()
                # skip empty lines.
                if not line:
                    continue
    
                # sanity check: we should only see 
                # lines starting with ">", the "value"
                # lines being consumed by the `next(f)` call
                if not line.startswith(">"):
                    raise ValueError(
                        "in file %s: line %s doesn't start with '>'" % (
                          path, num
                        ))
    
                # ok, proceed
                d[line.lstrip(">")] = next(f).strip()
    
        return d
    
    ems = fastamap('em_seq.fasta') 
    egs = fastamap('eg_seq.fasta')
    with open('rbh_res_eg-not-sec.txt') as rhb:
        for line in rhb:
            parts = line.split("\t") 
            emid, egid = parts[0].strip(), parts[1].strip()
    
            if emid in ems:
                print emid
                print ems[emid]
    
            if egid in egs:
                print egid
                print egs[egid]
    

    If this doesn't fly because of memory issues, well bad luck you're stuck with sequential scan (unless you want to use some database system but this might be a bit overkill), but - always assuming the emid/egid each only appears once in the fasta files - you can at least exit the inner loops once you've find your target:

    for l in rbh:
        # no need to split twice, you can just unpack
        emid, egid = l.split('\t')
    
        for lm in ems:      
            if emid in lm:
                print lm.strip()
                print next(ems).strip()
                break # no need to go further
    
        ems.seek(0) # reset the file pointer
    
        # etc...