Search code examples
sedwhile-loopfasta

while loop with sed. Substitute matching pattern in one file with text from another


I would like to add a string from a text file (file1) to a second text file (file2). The strings from file1 should be be added sequentially to file2 after every greater than symbol >. There are 9 greater than symbols in file2 and 9 strings in file1. File1 contains 9 different strings on lines 1-9, in column 1. Like this:

...
sctC_
sctJ_
sctV_
...

This is the while loop with sed I have tried to add the string into file2:

while IFS=$'\t' read  f1 f2  ; do sed "s/^>/&$f2/" ; done < <(paste  file2 file1)

However, only the first string gets added into file2 and the first line is stripped from file2:

MRNVLYAFLLTLYRGFCWSTVLLGMLPMAHAVTPPEWNKGAYAYSAEQTLLSTILIDFANSHGVELVMDN  sctJ_
LKDTLVEAKIRAETPAAFLDRLALEHRFQWFVYNHTLYVSSQDTQASIRLEISPDAAPDLKQALSGIGLL  sctV_
DPRFGWGELPEEGVVLVTGPQTYIDLIRNFSQQREKQDERRKVMIFPLRFASVSDRTLQYRDQRIVIPGV  sctN_
ATILSELMDGQRPPPTGASGPTDAVPDSAMEAMRENTRAMLTRLATRNNPARSTDENGRLVLNGRISADV  sctQ_
RNNALLVRDDEKRREEYQQLVEQIDVPQNLVNIDAIILDVDRTALSRLEANWQGTLGNVSAGSTMMMGRS  sctR_
TLFVSDFKRFFADIQALEGEGTASIVANPSVLTLENQPAIVDFSRTAFITATGERVAQIQPITAGTSLQV  sctS_
TPRVVGQDGPRSIQLVIDIEDGRVETGRDGEATGVKRGTVSTQALIGENRALVLGGFHVEESGDRDHRIP  sctT_
LLGDIPWLGRLFTSTRHEVSRRERLFILTPHLIGDQTDPTRYVSAENRHQINDVMNRVSQRNGKHDLYSL  sctU_
VENALRDLAGKQLPAGFQSETRGTRLSEVCRSQPGLVYDSNRYQWYGNGSIRLTVGVVRNSGTRIQRFDE  
SVCGSNRTLAVAAWPKTTLAPGESTEVFLALQTLSSTAPPRRSLLASY    
>sctC_12a_02741 hypothetical protein    
MKTDLRALFLLLSLLLMGCGDPIELNRGLSENDANEVIAALGRYQIAAEKRVDKTGVTLIIDAKNMERAV  
NILNAAGLPRQSRTNLGEVFQKSGVISTPLEERARYIYALSQEVEATLTQIDGVLVARVHVVLPERIAPG  
EPVQPASAAVFIKYQPELEPDSVEPRIRRMVASSIPGLSGKNDKDLSIVFVPAEPYQDTIPVVTLGPFTL  
TPQEMVRWQWTAGLMGALIIGLLAWRLGKPYMRQWQQNRADARQQR  
>sctC_12a_02750 Invasion protein InvA   
MNLVIIWLNRIALSAMQRSEVVGAVIVMSIVFMMIIPLPTSLIDVLIAFNICVSSLLIVLAMYLPKPLAF  
STFPAVLLLTTMFRLALSISTTRQILLQQDGGHIVEAFGNYVVGGNLAVGLVIFLILTVVNFLVITKGSE  
RVAEVAARFTLDAMPGKQMSIDSDLRAGLIEAHQARQRRDNLAKESQLFGAMDGAMKFVKGDAIAGLVIV  
FINMIGGFAIGVLQHGMSAADAMHVYSVLTIGDGLIAQIPALLISLTAGMIITRVSAEGQPLDANIGREI  
AEQLTSQPKAWIISALGMFGFALLPGMPSMVFMVISLASFSSGVFQLWRIKQQGILTHSQAEADNQPAEQ  
NGHQDLRRFNPTRAYLLQFHPSMQGNPATLSLVQHIRRLRNRLVYQFGMTLPSFDIEFSDRLDEDEFQFG  
VYEIPYVKATFVTERLAVHRSSFDQGELEDAIAGSTLRDEADWLWVSPMHPLLEQETCPRWAAGELILMR  
MENAIHRSGAQFIGLQETKSILTWLESEQPELAQELQRIMPLSRFAGVLQRLASERIPLRSVRPIAEALI  
EIGQHERDVHALTDYVRLALKAQICHQYSQQNTLHVWLLTPETEELLRDSLRQTQNETFFALTQDYAATL  
LGQLRRAFPPSLPSTGQILVAQDLRTPLRVLLQEEFHHVPVLSFSELESHLSINVLGRFDLYEENTPFSA  
>sctC_12a_02752 Type III secretion ATP synthase HrcN    
MQTQAAIDFPLMTRWFQQQRRRLSDFAPVDLKGRIIGISGILLECSLPRARIGDLCLVERQDGSQVMAEV  
VGFSPRNTFLSALGALDGIAQGAAVAPLYQPHCIQVSDRLFGSVLDGFGRALEDGGESAFVQPGELHGNA  
QPVLGDAPPPTARPRIATPLPTGLRAIDGLLTLGQGQRVGIFAGAGCGKTTLLAELARNTPCDAIVFGLI  
GERGRELREFLDHELDDDLRRRTVLVCSTSDRSSMERARAAFTATAIAEAYRAAGKQVLLIIDSLTRFAR  
AQREIGLALGEPQGRGGLPPSVYTLLPRLVERAGQTQTGAITALYSVLIEQDSMNDPVADEVRSLIDGHI  
VLTRRLAEQGHYPAIDVLASLSRTMSNVVDDGHNRHAGAVRRLMAAYKQVEMLIRLGEYQSGHDALTDSA  
VNAQQDITRFLRQAMRDPMAYDDIQQQLAEVSAHAP    

How can I get the string from file1 added recursively after the greater than symbol on file2?

Thanks,

JD


Solution

  • I'm not sure I understand exactly your requirements, but Perl should handle this easily. Read the first file into an array, then iterate over the second one and use the array to add the missing information.

    perl -we 'push @s, scalar <> until eof;
              chomp @s;
              s/(?<=^>)/shift @s/e, print while <>;
             ' file1 file2
    
    • <> is a shorter version of readline, it reads a line from file in scalar context.
    • eof returns true when a file is exhausted.
    • chomp removes trailing newlines from the array.
    • (?<=...) is a lookbehind, in this case, it matches after the > at a beginning of a line
    • the /e modifier to the substitution operator s/// evaluates the replacement as code, shift extracts the first element from the array @s