I have a file with many lines that look like this:-
(PTQ<38>_1:0.199472,(AagrBONN<35>*_0:0.247985,(((GBG<27>_0:0.357611,(Vocar<21>_1:0.91073,Klenit<20>_2:0.326442)<26>_1:0.070751)<31>_1:0.044341,(ME<25>_0:0.3226,SM<24>_0:0.318938)<30>_1:0.054235)<33>_1:0.094663,(EFJ<29>_3:0.314696,(((AmTr<15>_8:0.147156,((Zm<5>*_22:0.077246,Os<4>*_13:0.071153)<9>_16:0.173488,(VIT<3>_16:0.086305,(AT<1>*_10:0.182135,Potri<0>*_24:0.095723)<2>_15:0.025874)<8>_15:0.051653)<14>*_13:0.038202)<19>*_10:0.092418,(TnS<13>_7:0.205925,(PILA<7>_10:0.171663,SEGI<6>*_4:0.166892)<12>_7:0.020503)<18>_7:0.040364)<23>_7:0.091012,(Cericv<17>*_12:0.154953,(Azfi<11>_7:0.103752,Sacu<10>_10:0.11457)<16>_8:0.059988)<22>_8:0.123914)<28>*_6:0.036195)<32>*_3:0.024392)<34>_2:0.01915)<37>_2:0.028257,Pp<36>_2:0.235806)<39>_2;
The letters like PTQ and AagrBONN are abbreviations for species names, and I have a second file that looks that has the following information (the full names for respective abbrevaitons):-
AT Arabidopsis thaliana
AagrBONN Anthoceros angustus
AmTr Amborella trichopoda
Azfi Azolla filiculoides
Cericv Ceratopteris richardii
EFJ Selaginella moellendorffii
GBG Chara braunii
Klenit Klebsormidium flaccidum
ME Mesotaenium endlicherianum
Os Oryza sativa
PILA Pinus lambertiana
PTQ Marchantia polymorpha
Potri Populus trichocarpa
Pp Physcomitrella patens
SEGI Sequoiadendron giganteum
SM Spirogloea muscicola
Sacu Salvinia cucullata
TnS Gnetum montanum
VIT Vitis vinifera
Vocar Volvox carteri
Zm Zea may
For every line in the first file I need to replace the abbreviations with the full names. I tried to go about this with the following approach:-
Using the second file I made a dict with the abbreviations as the keys and full names as the values, saved this dict to a pickle file and wrote a code to loop through the lines in the first file and the keys in the dict and replace all regular expression matches of the keys (here the abbreviations) with their respective paired value (here the full names).
My current code is as follows:-
from os import replace
import pickle
import re
def main():
with open('species_dict.pkl','rb') as handle:
the_dict = pickle.load(handle)
with open('Base_asr.tre','rt') as the_base_file:
the_base_file_line = the_base_file.readlines()
for the_line in the_base_file_line:
for the_key in the_dict:
x = the_line.replace(the_key,the_dict[the_key])
print(x)
#print(x) ??
main()
The dict is just the tsv from the second file in dict format.
The problem is I have 216 lines in the first file and 20 entries in the dict, and when I print(x), I end up with 216 * 20 lines and for each of the 216 *20 lines only one abbreviation is substituted per line. I am trying to find a way to just print the 216 lines once.
I tried to change the print(x)
and assign it to the first for loop hoping that x would print after each dict key has been looped over per line but that did not work and did something I couldn't really understand.
I know this is happening because of the nature of my nested loop and I am conceptually lost as to how how I would get just 216 proper lines instead of 216 * 20 half-broken lines.
How should I go about this?
In the inner loop you use the original the_line
on each iteration, so you never replace all the values. You replace one value, then start with a fresh version of the_line
and replace the second value, etc.
If you want to keep the same basic code format, you should overwrite the original the_line
in each iteration to capture all the replacements:
the_line = the_line.replace(the_key,the_dict[the_key])
Now after the first iteration the_line
will be the original with the first replacement, then the next iteration assigns it with the second, etc...
The outside the inner loop you can print(the_line)
FWIW, you could probably process each line with a single regex using something like:
import re
for the_line in the_base_file_line:
x = re.sub(r'\w+(?=<)', lambda m: the_dict[m.group()], the_line)
print(x)