pull information from each unique pair

I have coordinates for each site and the year each site was sampled (fake dataframe below).

dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020

> dfA
    LAT LONG YEAR
 1    1   11 2001
 2    2   12 2002
 3    3   13 2003
 4    4   14 2004
 5    5   15 2005
 6    1   11 2006
 7    2   12 2007
 8    3   13 2008
 9    4   14 2009
 10   5   15 2010
 11   1   11 2011
 12   2   12 2012
 13   3   13 2013
 14   4   14 2014
 15   5   15 2015
 16   1   16 2016
 17   2   17 2017
 18   3   18 2018
 19   4   19 2019
 20   5   20 2020

I'm trying to pull out the year each unique location was sampled. So I first pulled out each unique location and the times it was sampled using the following code

dfB <- dfA %>% 
group_by(LAT, LONG) %>%
summarise(Freq = n())
dfB<-as.data.frame(dfB) 

   LAT LONG Freq
1    1   11    3
2    1   16    1
3    2   12    3
4    2   17    1
5    3   13    3
6    3   18    1
7    4   14    3
8    4   19    1
9    5   15    3
10   5   20    1

I am now trying to get the year for each unique location. I.e. I ultimately want this:

   LAT LONG Freq .  Year
1    1   11    3 .  2001,2006,2011
2    1   16    1 .  2016
3    2   12    3 .  2002,2007,2012
4    2   17    1
5    3   13    3
6    3   18    1
7    4   14    3
8    4   19    1
9    5   15    3
10   5   20    1

This is what I've tried:

1) Find which rows in dfA that corresponds with dfB:

dfB$obs_Year<-NA
idx <- match(paste(dfA$LAT,dfA$LONG), paste(dfB$LAT,dfB$LONG))

> idx
[1]  1  3  5  7  9  1  3  5  7  9  1  3  5  7  9  2  4  6  8 10

So idx[1] means dfA[1] matches dfB[1]. And that dfA[6],df[11] all match dfB[1].

I've tried this to extract info:

for (row in 1:20){
  year<-as.character(dfA$YEAR[row])
  tmp<-dfB$obs_Year[idx[row]]
  if(isTRUE(is.na(dfB$obs_Year[idx[row]]))){
    dfB$obs_Year[idx[row]]<-year
  }
  if(isFALSE(is.na(dfB$obs_Year[idx[row]]))){
    dfB$obs_Year[idx[row]]<-as.list(append(tmp,year))
  }
}

I keep getting this error code:

number of items to replace is not a multiple of replacement length

Does anyone know how to extract years from matching pairs of dfA to dfB? I don't know if this is the most efficient code but this is as far as I've gotten....Thanks in advance!

Solution

You can do this with a dplyr chain that first builds your date column and then filters down to only unique observations.

The logic is to build the date variable by grouping your data by locations, and then pasting all the dates for a given location into a single string variable which we call year_string. We then also compute the frequency but this is not strictly necessary.

The only column in your data that varies over time is YEAR, meaning that if we exclude that column you would see values repeated for locations. So we exclude the YEAR column and then ask R to return unique() values of the data.frame to us. It will pick one of the observations per location where multiple occur, but since they are identical that doesn't matter.

Code below:

library(dplyr)

dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020

# We assign the output to dfB
dfB <- dfA %>% group_by(LAT, LONG) %>% # We group by locations
  mutate( # The mutate verb is for building new variables.
    year_string = paste(YEAR, collapse = ","), # the function paste()
                          # collapses the vector YEAR into a string
                          # the argument collapse = "," says to 
                          # separate each element of the string with a comma
         Freq = n()) %>% # I compute the frequency as you did
  select(LAT, LONG, Freq, year_string) %>% 
             # Now I select only the columns that index
             # location, frequency and the combined years
  unique() # Now I filter for only unique observations. Since I have not picked
           # YEAR in the select function only unique locations are retained

dfB
#> # A tibble: 10 x 4
#> # Groups:   LAT, LONG [10]
#>      LAT  LONG  Freq year_string   
#>    <int> <int> <int> <chr>         
#>  1     1    11     3 2001,2006,2011
#>  2     2    12     3 2002,2007,2012
#>  3     3    13     3 2003,2008,2013
#>  4     4    14     3 2004,2009,2014
#>  5     5    15     3 2005,2010,2015
#>  6     1    16     1 2016          
#>  7     2    17     1 2017          
#>  8     3    18     1 2018          
#>  9     4    19     1 2019          
#> 10     5    20     1 2020

^{Created on 2019-01-21 by the reprex package (v0.2.1)}