Search code examples
runiqueextract

pull information from each unique pair


I have coordinates for each site and the year each site was sampled (fake dataframe below).

dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020

> dfA
    LAT LONG YEAR
 1    1   11 2001
 2    2   12 2002
 3    3   13 2003
 4    4   14 2004
 5    5   15 2005
 6    1   11 2006
 7    2   12 2007
 8    3   13 2008
 9    4   14 2009
 10   5   15 2010
 11   1   11 2011
 12   2   12 2012
 13   3   13 2013
 14   4   14 2014
 15   5   15 2015
 16   1   16 2016
 17   2   17 2017
 18   3   18 2018
 19   4   19 2019
 20   5   20 2020

I'm trying to pull out the year each unique location was sampled. So I first pulled out each unique location and the times it was sampled using the following code

dfB <- dfA %>% 
group_by(LAT, LONG) %>%
summarise(Freq = n())
dfB<-as.data.frame(dfB) 

   LAT LONG Freq
1    1   11    3
2    1   16    1
3    2   12    3
4    2   17    1
5    3   13    3
6    3   18    1
7    4   14    3
8    4   19    1
9    5   15    3
10   5   20    1

I am now trying to get the year for each unique location. I.e. I ultimately want this:

   LAT LONG Freq .  Year
1    1   11    3 .  2001,2006,2011
2    1   16    1 .  2016
3    2   12    3 .  2002,2007,2012
4    2   17    1
5    3   13    3
6    3   18    1
7    4   14    3
8    4   19    1
9    5   15    3
10   5   20    1

This is what I've tried:

1) Find which rows in dfA that corresponds with dfB:

dfB$obs_Year<-NA
idx <- match(paste(dfA$LAT,dfA$LONG), paste(dfB$LAT,dfB$LONG))

> idx
[1]  1  3  5  7  9  1  3  5  7  9  1  3  5  7  9  2  4  6  8 10

So idx[1] means dfA[1] matches dfB[1]. And that dfA[6],df[11] all match dfB[1].

I've tried this to extract info:

for (row in 1:20){
  year<-as.character(dfA$YEAR[row])
  tmp<-dfB$obs_Year[idx[row]]
  if(isTRUE(is.na(dfB$obs_Year[idx[row]]))){
    dfB$obs_Year[idx[row]]<-year
  }
  if(isFALSE(is.na(dfB$obs_Year[idx[row]]))){
    dfB$obs_Year[idx[row]]<-as.list(append(tmp,year))
  }
}

I keep getting this error code:

number of items to replace is not a multiple of replacement length

Does anyone know how to extract years from matching pairs of dfA to dfB? I don't know if this is the most efficient code but this is as far as I've gotten....Thanks in advance!


Solution

  • You can do this with a dplyr chain that first builds your date column and then filters down to only unique observations.

    The logic is to build the date variable by grouping your data by locations, and then pasting all the dates for a given location into a single string variable which we call year_string. We then also compute the frequency but this is not strictly necessary.

    The only column in your data that varies over time is YEAR, meaning that if we exclude that column you would see values repeated for locations. So we exclude the YEAR column and then ask R to return unique() values of the data.frame to us. It will pick one of the observations per location where multiple occur, but since they are identical that doesn't matter.

    Code below:

    library(dplyr)
    
    dfA<-matrix(nrow=20,ncol=3)
    dfA<-as.data.frame(dfA)
    colnames(dfA)<-c("LAT","LONG","YEAR")
    #fill LAT
    dfA[,1]<-rep(1:5,4)
    #fill LONG
    dfA[,2]<-c(rep(11:15,3),16:20)
    #fill YEAR
    dfA[,3]<-2001:2020
    
    # We assign the output to dfB
    dfB <- dfA %>% group_by(LAT, LONG) %>% # We group by locations
      mutate( # The mutate verb is for building new variables.
        year_string = paste(YEAR, collapse = ","), # the function paste()
                              # collapses the vector YEAR into a string
                              # the argument collapse = "," says to 
                              # separate each element of the string with a comma
             Freq = n()) %>% # I compute the frequency as you did
      select(LAT, LONG, Freq, year_string) %>% 
                 # Now I select only the columns that index
                 # location, frequency and the combined years
      unique() # Now I filter for only unique observations. Since I have not picked
               # YEAR in the select function only unique locations are retained
    
    dfB
    #> # A tibble: 10 x 4
    #> # Groups:   LAT, LONG [10]
    #>      LAT  LONG  Freq year_string   
    #>    <int> <int> <int> <chr>         
    #>  1     1    11     3 2001,2006,2011
    #>  2     2    12     3 2002,2007,2012
    #>  3     3    13     3 2003,2008,2013
    #>  4     4    14     3 2004,2009,2014
    #>  5     5    15     3 2005,2010,2015
    #>  6     1    16     1 2016          
    #>  7     2    17     1 2017          
    #>  8     3    18     1 2018          
    #>  9     4    19     1 2019          
    #> 10     5    20     1 2020
    

    Created on 2019-01-21 by the reprex package (v0.2.1)