Search code examples
pythonlogicnested-loops

How do I not print the unwanted products of a nested loop?


I have a file with many lines that look like this:-

(PTQ<38>_1:0.199472,(AagrBONN<35>*_0:0.247985,(((GBG<27>_0:0.357611,(Vocar<21>_1:0.91073,Klenit<20>_2:0.326442)<26>_1:0.070751)<31>_1:0.044341,(ME<25>_0:0.3226,SM<24>_0:0.318938)<30>_1:0.054235)<33>_1:0.094663,(EFJ<29>_3:0.314696,(((AmTr<15>_8:0.147156,((Zm<5>*_22:0.077246,Os<4>*_13:0.071153)<9>_16:0.173488,(VIT<3>_16:0.086305,(AT<1>*_10:0.182135,Potri<0>*_24:0.095723)<2>_15:0.025874)<8>_15:0.051653)<14>*_13:0.038202)<19>*_10:0.092418,(TnS<13>_7:0.205925,(PILA<7>_10:0.171663,SEGI<6>*_4:0.166892)<12>_7:0.020503)<18>_7:0.040364)<23>_7:0.091012,(Cericv<17>*_12:0.154953,(Azfi<11>_7:0.103752,Sacu<10>_10:0.11457)<16>_8:0.059988)<22>_8:0.123914)<28>*_6:0.036195)<32>*_3:0.024392)<34>_2:0.01915)<37>_2:0.028257,Pp<36>_2:0.235806)<39>_2;

The letters like PTQ and AagrBONN are abbreviations for species names, and I have a second file that looks that has the following information (the full names for respective abbrevaitons):-

AT  Arabidopsis thaliana
AagrBONN    Anthoceros angustus
AmTr    Amborella trichopoda
Azfi    Azolla filiculoides
Cericv  Ceratopteris richardii
EFJ Selaginella moellendorffii
GBG Chara braunii
Klenit  Klebsormidium flaccidum
ME  Mesotaenium endlicherianum
Os  Oryza sativa
PILA    Pinus lambertiana
PTQ Marchantia polymorpha
Potri   Populus trichocarpa
Pp  Physcomitrella patens
SEGI    Sequoiadendron giganteum
SM  Spirogloea muscicola
Sacu    Salvinia cucullata
TnS Gnetum montanum
VIT Vitis vinifera
Vocar   Volvox carteri
Zm  Zea may 

For every line in the first file I need to replace the abbreviations with the full names. I tried to go about this with the following approach:-

Using the second file I made a dict with the abbreviations as the keys and full names as the values, saved this dict to a pickle file and wrote a code to loop through the lines in the first file and the keys in the dict and replace all regular expression matches of the keys (here the abbreviations) with their respective paired value (here the full names).

My current code is as follows:-

from os import replace
import pickle
import re

def main():

    
    with open('species_dict.pkl','rb') as handle:
        the_dict = pickle.load(handle)

    with open('Base_asr.tre','rt') as the_base_file:
        the_base_file_line = the_base_file.readlines()

    for the_line in the_base_file_line:
        for the_key in the_dict:
            x = the_line.replace(the_key,the_dict[the_key])
            print(x)
        #print(x) ??    

main()

The dict is just the tsv from the second file in dict format.

The problem is I have 216 lines in the first file and 20 entries in the dict, and when I print(x), I end up with 216 * 20 lines and for each of the 216 *20 lines only one abbreviation is substituted per line. I am trying to find a way to just print the 216 lines once.

I tried to change the print(x) and assign it to the first for loop hoping that x would print after each dict key has been looped over per line but that did not work and did something I couldn't really understand.

I know this is happening because of the nature of my nested loop and I am conceptually lost as to how how I would get just 216 proper lines instead of 216 * 20 half-broken lines.

How should I go about this?


Solution

  • In the inner loop you use the original the_line on each iteration, so you never replace all the values. You replace one value, then start with a fresh version of the_line and replace the second value, etc.

    If you want to keep the same basic code format, you should overwrite the original the_line in each iteration to capture all the replacements:

    the_line = the_line.replace(the_key,the_dict[the_key])
    

    Now after the first iteration the_line will be the original with the first replacement, then the next iteration assigns it with the second, etc...

    The outside the inner loop you can print(the_line)

    FWIW, you could probably process each line with a single regex using something like:

    import re
    for the_line in the_base_file_line:
        x = re.sub(r'\w+(?=<)', lambda m: the_dict[m.group()], the_line)
        print(x)