Search code examples
rdplyr

Assigning Family to Species Based on Dataset Attributes


I have a dataset with thousands of lines and the following columns: ID, parentID, rank, and scientificName.

I wish to create a new column that will inform the family (a level in rank) a given species belong to. If anyone could help, it would be greatly appreciated.

Example data:

ID = c('f1','f2','g1','g2','g3','g4','s1','s2','s3','s4','s5','s6') # all unique
parentID = c(NA,NA,'f1','f1','f2','f2','g1','g1','g2','g3','g3','g4')
rank = c('family','family','genus','genus','genus','genus','species','species','species','species','species','species')
scientificName = c('FamA','FamB','GenA','GenB','GenC','GenD','SpA','SpB','SpC','SpD','SpE','SpF')
dat = data.frame( ID, parentID, rank, scientificName)

My desired output (in this example) would be an extra column informing the families as: family = c('famA','famB','famA','famA','famB','famB','famA','famA','famA','famB','famB','famB')

I've thought about creating vectors of families and their IDs, then changing codes in the ParentID column by family names, and then trying something similar for the genus to ultimately 'link' family info with each species, but it got kinda messy in the end (that is, it didn't work). I think what I need can be accomplished through 'dplyr' package, but I'm stuck... Again, I'd appreciate any help.


Solution

  • This is a good problem for recursion. Here's a vectorized base R solution.

    find_family <- function(ID, parentID, scientificName) {
      find_family_id <- function(ID, parentID) {
        ID_new <- ifelse(!is.na(parentID), parentID, ID)
        parentID_new <- parentID[match(ID_new, ID)]
        if (all(is.na(parentID_new))) return(ID_new)
        find_family_id(ID_new, parentID_new)
      }
      family_ids <- find_family_id(ID, parentID)
      scientificName[match(family_ids, ID)]
    }
    
    dat$family <- with(dat, find_family(ID, parentID, scientificName))
    
    dat
    #    ID parentID    rank scientificName family
    # 1  f1     <NA>  family           FamA   FamA
    # 2  f2     <NA>  family           FamB   FamB
    # 3  g1       f1   genus           GenA   FamA
    # 4  g2       f1   genus           GenB   FamA
    # 5  g3       f2   genus           GenC   FamB
    # 6  g4       f2   genus           GenD   FamB
    # 7  s1       g1 species            SpA   FamA
    # 8  s2       g1 species            SpB   FamA
    # 9  s3       g2 species            SpC   FamA
    # 10 s4       g3 species            SpD   FamB
    # 11 s5       g3 species            SpE   FamB
    # 12 s6       g4 species            SpF   FamB