Search code examples
rsubsetreverse-lookup

R: Cross-referencing columns/reverse lookup


I've found a solution for this but suspect there must be a more natural or idiomatic way. Given a dataset of many observations over several years at a lot of stations, get a listing by station of the years in which each was active -- should be trivial. The data looks roughly like so:

set.seed(668)
yrNames <- seq(1995,2015)
staNames <- c(LETTERS[1:12])
trpNames <- seq(1,6)
years <- rep(yrNames, times=rep(sample(1:4, length(yrNames), replace=TRUE)))
stations <- sample(staNames, length(years), replace=TRUE)
traps <- sample(trpNames, length(years), replace=TRUE)
data <- data.frame(YEAR=years, STATION=stations, TRAP=traps)

After WAY too many hours (working hard to think vectorwise, avoid loops) I finally worked my way to:

library("reshape2")
bySta <- dcast(data, YEAR ~ STATION)
sapply(bySta, function(x){ return(bySta$YEAR[x > 0])})

Which gives what I wanted:

# $YEAR
#  [1] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
# [16] 2010 2011 2012 2013 2014 2015
# $A
# [1] 2002 2009 2015
# $B
# [1] 1996 1999 2003 2007 2013
# $C
# [1] 2000 2002 2005 2006 2009 2010 2014
# # [...]

But getting there was very far from intuitive, with all kinds of dead ends. Is there some way to more simply say "list me all df$x per value of df$y"?

An extra wrinkle is that I was starting from a list of per-year dfs created by a

dfList <- lapply(fileList, readDelimFunc)

which I was happier with for other purposes but then for this task the extra organizational layer got me too baffled right away so I mashed them together into one. Could the desired listing also be (sanely) generated from that list of dfs, or is that ridiculous?


Solution

  • dplyr solution:

    data %>% group_by(STATION) %>% summarize(years = list(unique(YEAR))) %>% as.data.frame
    

    Results:

       STATION                                    years
    1        A                         2002, 2009, 2015
    2        B             1996, 1999, 2003, 2007, 2013
    3        C 2000, 2002, 2005, 2006, 2009, 2010, 2014
    4        D                   2003, 2005, 2010, 2014
    5        E                               1997, 2005
    6        F       1996, 1997, 1998, 2001, 2014, 2015
    7        G                               1996, 2001
    8        H                         1995, 1997, 2003
    9        I                         1996, 1997, 2008
    10       J                         1999, 2001, 2009
    11       K             2003, 2004, 2010, 2011, 2012
    12       L                   2002, 2004, 2011, 2015
    

    Note that Xapply loops are not actually "vectorized", they are just wrappers around iterations of normal R function calls. (Neither is this dplyr solution "vectorized").

    It's best not to get hung up on finding the most optimal solution, and rather finding the most sensical solution.