I have a dataset with thousands of lines and the following columns: ID, parentID, rank, and scientificName.
I wish to create a new column that will inform the family (a level in rank) a given species belong to. If anyone could help, it would be greatly appreciated.
Example data:
ID = c('f1','f2','g1','g2','g3','g4','s1','s2','s3','s4','s5','s6') # all unique
parentID = c(NA,NA,'f1','f1','f2','f2','g1','g1','g2','g3','g3','g4')
rank = c('family','family','genus','genus','genus','genus','species','species','species','species','species','species')
scientificName = c('FamA','FamB','GenA','GenB','GenC','GenD','SpA','SpB','SpC','SpD','SpE','SpF')
dat = data.frame( ID, parentID, rank, scientificName)
My desired output (in this example) would be an extra column informing the families as: family = c('famA','famB','famA','famA','famB','famB','famA','famA','famA','famB','famB','famB')
I've thought about creating vectors of families and their IDs, then changing codes in the ParentID column by family names, and then trying something similar for the genus to ultimately 'link' family info with each species, but it got kinda messy in the end (that is, it didn't work). I think what I need can be accomplished through 'dplyr' package, but I'm stuck... Again, I'd appreciate any help.
This is a good problem for recursion. Here's a vectorized base R solution.
find_family <- function(ID, parentID, scientificName) {
find_family_id <- function(ID, parentID) {
ID_new <- ifelse(!is.na(parentID), parentID, ID)
parentID_new <- parentID[match(ID_new, ID)]
if (all(is.na(parentID_new))) return(ID_new)
find_family_id(ID_new, parentID_new)
}
family_ids <- find_family_id(ID, parentID)
scientificName[match(family_ids, ID)]
}
dat$family <- with(dat, find_family(ID, parentID, scientificName))
dat
# ID parentID rank scientificName family
# 1 f1 <NA> family FamA FamA
# 2 f2 <NA> family FamB FamB
# 3 g1 f1 genus GenA FamA
# 4 g2 f1 genus GenB FamA
# 5 g3 f2 genus GenC FamB
# 6 g4 f2 genus GenD FamB
# 7 s1 g1 species SpA FamA
# 8 s2 g1 species SpB FamA
# 9 s3 g2 species SpC FamA
# 10 s4 g3 species SpD FamB
# 11 s5 g3 species SpE FamB
# 12 s6 g4 species SpF FamB