I have a working script that does the following: It loops through a dataframe with taxon names, finding the corresponding numerical ID for each taxon name. If the ID is NA it keeps the name from the dataframe. It writes this into a new dataframe. It works, but I think its a little messy and I am looking for any suggestions to improve it or make it simpler. I am using the package taxize to get the IDs but the suggestions don't need to necessarily use it. Here is the example data and script:
Kingdom | Phylum | Class | Order | Family | Genus |
---|---|---|---|---|---|
Bacteria | Firmicutes | Clostridia | Eubacteriales | Lachnospiraceae | Dorea |
Bacteria | Firmicutes | Clostridia | Eubacteriales | Oscillospiraceae | GGB9634 |
Bacteria | Firmicutes | Clostridia | Eubacteriales | Clostridiaceae | Clostridiaceae_unclassified |
structure(list(Kingdom = c("Bacteria", "Bacteria", "Bacteria"
), Phylum = c("Firmicutes", "Firmicutes", "Firmicutes"), Class = c("Clostridia",
"Clostridia", "Clostridia"), Order = c("Eubacteriales", "Eubacteriales",
"Eubacteriales"), Family = c("Lachnospiraceae", "Oscillospiraceae",
"Clostridiaceae"), Genus = c("Dorea", "GGB9634", "Clostridiaceae_unclassified"
)), row.names = c(NA, -3L), class = "data.frame")
script:
sad<-taxa[1:3,] # dataframe shown above
numID <- data.frame(sad$Kingdom) # dataframe to store the IDs
taxize::taxize_options(ncbi_sleep = 0.9) # adjust http request rate
for(r in 1:length(sad[,1])){
for(c in 2:length(sad[1,])){
sadID<-taxize::get_uid(sad[r,c], ask=F)[1]
if(is.na(sadID)){
numID[r,c]<- sad[r,c]
}
else{
numID[r,c] <- sadID
}
}}
names(numID)<-names(sad)
#numID (wanted output)
structure(list(Kingdom = c("Bacteria", "Bacteria", "Bacteria"),
Phylum = c("Firmicutes", "Firmicutes", "Firmicutes"),
Class = c("186801","186801", "186801"),
Order = c("186802", "186802", "186802"),
Family = c("186803", "216572", "31979"),
Genus = c("189330","GGB9634", "Clostridiaceae_unclassified")), row.names = c(NA,3L), class = "data.frame")
If I were to use this script (or any other), but wanted to start with column 4 instead of 2, how could I do that?, since I am using c both for sad and numID.
The script is working but I want to improve it.
get_uid
takes vectors. You can simply do:
numID <- sad
numID[] <- lapply(
sad,
\(x) dplyr::coalesce(taxize::get_uid(x, ask = FALSE) |> unclass(), x)
)
This passes each column to get_uid
and writes the results to numID
. unclass
will drop the uid
class and convert to character. coalesce
will replace NA
values with the original names. You could improve things by only requesting IDs for unique values.
Kingdom Phylum Class Order Family Genus 1 2 Firmicutes 186801 186802 186803 189330 2 2 Firmicutes 186801 186802 216572 GGB9634 3 2 Firmicutes 186801 186802 31979 Clostridiaceae_unclassified