Search code examples
rdataframefor-loop

Better looping script in R


I have a working script that does the following: It loops through a dataframe with taxon names, finding the corresponding numerical ID for each taxon name. If the ID is NA it keeps the name from the dataframe. It writes this into a new dataframe. It works, but I think its a little messy and I am looking for any suggestions to improve it or make it simpler. I am using the package taxize to get the IDs but the suggestions don't need to necessarily use it. Here is the example data and script:

Kingdom Phylum Class Order Family Genus
Bacteria Firmicutes Clostridia Eubacteriales Lachnospiraceae Dorea
Bacteria Firmicutes Clostridia Eubacteriales Oscillospiraceae GGB9634
Bacteria Firmicutes Clostridia Eubacteriales Clostridiaceae Clostridiaceae_unclassified
structure(list(Kingdom = c("Bacteria", "Bacteria", "Bacteria"
), Phylum = c("Firmicutes", "Firmicutes", "Firmicutes"), Class = c("Clostridia", 
"Clostridia", "Clostridia"), Order = c("Eubacteriales", "Eubacteriales", 
"Eubacteriales"), Family = c("Lachnospiraceae", "Oscillospiraceae", 
"Clostridiaceae"), Genus = c("Dorea", "GGB9634", "Clostridiaceae_unclassified"
)), row.names = c(NA, -3L), class = "data.frame")

script:

sad<-taxa[1:3,] # dataframe shown above
numID <- data.frame(sad$Kingdom) # dataframe to store the IDs
taxize::taxize_options(ncbi_sleep = 0.9) # adjust http request rate

for(r in 1:length(sad[,1])){
  for(c in 2:length(sad[1,])){
    sadID<-taxize::get_uid(sad[r,c], ask=F)[1]
    if(is.na(sadID)){
      numID[r,c]<- sad[r,c]
    }
    else{
      numID[r,c] <- sadID
    }
  }}
names(numID)<-names(sad)

#numID (wanted output)
structure(list(Kingdom = c("Bacteria", "Bacteria", "Bacteria"),
Phylum = c("Firmicutes", "Firmicutes", "Firmicutes"), 
Class = c("186801","186801", "186801"),
Order = c("186802", "186802", "186802"), 
Family = c("186803", "216572", "31979"), 
Genus = c("189330","GGB9634", "Clostridiaceae_unclassified")), row.names = c(NA,3L), class = "data.frame")

If I were to use this script (or any other), but wanted to start with column 4 instead of 2, how could I do that?, since I am using c both for sad and numID.

The script is working but I want to improve it.


Solution

  • get_uid takes vectors. You can simply do:

    numID <- sad
    numID[] <- lapply(
      sad, 
      \(x) dplyr::coalesce(taxize::get_uid(x, ask = FALSE) |> unclass(), x)
    )
    

    This passes each column to get_uid and writes the results to numID. unclass will drop the uid class and convert to character. coalesce will replace NA values with the original names. You could improve things by only requesting IDs for unique values.

      Kingdom     Phylum  Class  Order Family                       Genus
    1       2 Firmicutes 186801 186802 186803                      189330
    2       2 Firmicutes 186801 186802 216572                     GGB9634
    3       2 Firmicutes 186801 186802  31979 Clostridiaceae_unclassified