Search code examples
rfiledata-cleaningsubstitution

Finding and substituting set of codes/words in a file based on a list of old and corrected ones


I have a FASTA_16S.txt file containing paragraphs of different lengths with a unique code (e.g. 16S317) at the top. After transfer into R, I have a list with 413 members that looks like this:

[1]">16S317_V._rotiferianus_A\n
AAATTGAAGAGTTTGATCATGGCTCAG..."
[2]">16S318_Salmonella_bongori\n
AAATTGAAGAGTTTGATCATGGCTCAGATT..."
[3]">16S319_Escherichia_coli\n
TTGAAGAGTTTGATCATGGCTCAGATTG...

I need to substitute the existing codes with the new ones from a table Code_16S:

     Old    New
 1. 16S317 16S001
 2. 16S318 16S307 
 3. 16S319 16S211
 4.  ...    ...

Can anybody suggest a code that would identify an old code and substitute it with a new one? Consider that we have the same codes in columns New and Old, so direct application of gsub or replace for the whole list did not work (after a substitution we have two paragraphs with the same code, so one of the next steps will change both of them).

Below there is my solution for the problem, but I don´t consider it as an optimal.


Solution

  • Instead of using lapply, it may be easier with str_replace_all

    library(stringr)
    library(tibble)
    FASTA_16S <- str_replace_all(FASTA_16S, deframe(Code_16S))
    

    -output

    FASTA_16S
    [1] ">16S001_V._rotiferianus_A\n\nAAATTGAAGAGTTTGATCATGGCTCAG..."   
    [2] ">16S307_Salmonella_bongori\n\nAAATTGAAGAGTTTGATCATGGCTCAGATT..."
    

    data

    FASTA_16S <- c(">16S317_V._rotiferianus_A\n\nAAATTGAAGAGTTTGATCATGGCTCAG...", 
    ">16S318_Salmonella_bongori\n\nAAATTGAAGAGTTTGATCATGGCTCAGATT..."
    )
    Code_16S <- structure(list(Old = c("16S317", "16S318", "16S319"), New = c("16S001", 
    "16S307", "16S211")), class = "data.frame", row.names = c("1.", 
    "2.", "3."))