Search code examples
rbioinformaticsropensci

Is there a faster way to find synonyms for a large list of taxa in R?


I have a list of about ~96,000 species names I need to collect all synonyms for. I have tried the 'taxize' package with the synonyms() function, which outputs the information I need but my list is too long for it to work properly. I have looked into the 'taxizedb' package which has been suggested as faster for some users before, but I am not sure which functions within this package will accomplish what I am trying to do.

Any suggestions would be greatly appreciated! Thanks!

Code so far:

library("taxize")
library("tidyverse")

#load in list of species (~96,000)
#vspli <- read.csv(file="AllBHLspecieslist.csv", header=TRUE) #my code
vspli <- c("Acer obtusatum", "Acer interius", "Acer opalus", "Acer saccharum", "Acer palmatum") #workable example
#Use Taxize to search for synonyms
synlist1 <- synonyms(c(vspli), db="itis", rows=1) #currently this line of code crashes before completion when using the list of 96k species

Solution

  • In case anyone comes across this later, I found the package 'taxadb' which allowed for the completion of this problem much faster. Here is the code in case it proves useful:

    library(taxadb)
    
    #create local itis database
    td_create("itis",overwrite=FALSE)
    
    allnames<-read.csv(file="AllBHLspecieslist.csv", header=TRUE)
    
    
    
    #get  IDS for each scientific name
    syn1<-allnames %>%
      select(Scientific.Name) %>%
      mutate(ID=get_ids(Scientific.Name,"itis"))
    
    #Deal with NAs (one name corresponds to more than 1 ITIS code) (~10k names)
    
    syn1_NA<-as.data.frame(syn1$Scientific.Name[is.na(syn1$ID)])
    colnames(syn1_NA)<-c("name")
    
    NA_IDS<-NULL
    for(i in unique(syn1_NA$name)){
      tmp<-as.data.frame(filter_name(i, 'itis')[5])
      tmp$name<-paste0(i)
      NA_IDS<-rbind(NA_IDS,tmp)
    }
    
    #join with originial names
    colnames(syn1)<-c("name","ID")
    IDS<-left_join(syn1,NA_IDS,by="name") #I think its a left join double check this
    
    #extract just the unique IDs
    IDS<-data.frame(ID=c(IDS[,"ID"],IDS[,"acceptedNameUsageID"]))
    IDS<-as.data.frame(unique(IDS$ID))
    IDS<-as.data.frame(IDS[-is.na(IDS)])
    colnames(IDS)<-"ID"
    #extract all names with synonyms in ITIS that are at the species level [literally all of them]
    #set query
    ITIS<-taxa_tbl("itis") %>%
      select(scientificName,taxonRank,acceptedNameUsageID,taxonomicStatus) %>%
      filter(taxonRank == "species")
    
    #see query
    ITIS %>% show_query()
    #retrieve results
    ITIS_names<-ITIS %>% collect()
    
    #filter to only those that match ITIS codes for all my species
    ITIS_names<-ITIS_names %>%
      filter(acceptedNameUsageID %in% IDS$ID)