I have coordinates for each site and the year each site was sampled (fake dataframe below).
dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020
> dfA
LAT LONG YEAR
1 1 11 2001
2 2 12 2002
3 3 13 2003
4 4 14 2004
5 5 15 2005
6 1 11 2006
7 2 12 2007
8 3 13 2008
9 4 14 2009
10 5 15 2010
11 1 11 2011
12 2 12 2012
13 3 13 2013
14 4 14 2014
15 5 15 2015
16 1 16 2016
17 2 17 2017
18 3 18 2018
19 4 19 2019
20 5 20 2020
I'm trying to pull out the year each unique location was sampled. So I first pulled out each unique location and the times it was sampled using the following code
dfB <- dfA %>%
group_by(LAT, LONG) %>%
summarise(Freq = n())
dfB<-as.data.frame(dfB)
LAT LONG Freq
1 1 11 3
2 1 16 1
3 2 12 3
4 2 17 1
5 3 13 3
6 3 18 1
7 4 14 3
8 4 19 1
9 5 15 3
10 5 20 1
I am now trying to get the year for each unique location. I.e. I ultimately want this:
LAT LONG Freq . Year
1 1 11 3 . 2001,2006,2011
2 1 16 1 . 2016
3 2 12 3 . 2002,2007,2012
4 2 17 1
5 3 13 3
6 3 18 1
7 4 14 3
8 4 19 1
9 5 15 3
10 5 20 1
This is what I've tried:
1) Find which rows in dfA that corresponds with dfB:
dfB$obs_Year<-NA
idx <- match(paste(dfA$LAT,dfA$LONG), paste(dfB$LAT,dfB$LONG))
> idx
[1] 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 2 4 6 8 10
So idx[1] means dfA[1] matches dfB[1]. And that dfA[6],df[11] all match dfB[1].
I've tried this to extract info:
for (row in 1:20){
year<-as.character(dfA$YEAR[row])
tmp<-dfB$obs_Year[idx[row]]
if(isTRUE(is.na(dfB$obs_Year[idx[row]]))){
dfB$obs_Year[idx[row]]<-year
}
if(isFALSE(is.na(dfB$obs_Year[idx[row]]))){
dfB$obs_Year[idx[row]]<-as.list(append(tmp,year))
}
}
I keep getting this error code:
number of items to replace is not a multiple of replacement length
Does anyone know how to extract years from matching pairs of dfA to dfB? I don't know if this is the most efficient code but this is as far as I've gotten....Thanks in advance!
You can do this with a dplyr
chain that first builds your date column and then filters down to only unique observations.
The logic is to build the date variable by grouping your data by locations, and then pasting all the dates for a given location into a single string variable which we call year_string
. We then also compute the frequency but this is not strictly necessary.
The only column in your data that varies over time is YEAR, meaning that if we exclude that column you would see values repeated for locations. So we exclude the YEAR column and then ask R to return unique()
values of the data.frame to us. It will pick one of the observations per location where multiple occur, but since they are identical that doesn't matter.
Code below:
library(dplyr)
dfA<-matrix(nrow=20,ncol=3)
dfA<-as.data.frame(dfA)
colnames(dfA)<-c("LAT","LONG","YEAR")
#fill LAT
dfA[,1]<-rep(1:5,4)
#fill LONG
dfA[,2]<-c(rep(11:15,3),16:20)
#fill YEAR
dfA[,3]<-2001:2020
# We assign the output to dfB
dfB <- dfA %>% group_by(LAT, LONG) %>% # We group by locations
mutate( # The mutate verb is for building new variables.
year_string = paste(YEAR, collapse = ","), # the function paste()
# collapses the vector YEAR into a string
# the argument collapse = "," says to
# separate each element of the string with a comma
Freq = n()) %>% # I compute the frequency as you did
select(LAT, LONG, Freq, year_string) %>%
# Now I select only the columns that index
# location, frequency and the combined years
unique() # Now I filter for only unique observations. Since I have not picked
# YEAR in the select function only unique locations are retained
dfB
#> # A tibble: 10 x 4
#> # Groups: LAT, LONG [10]
#> LAT LONG Freq year_string
#> <int> <int> <int> <chr>
#> 1 1 11 3 2001,2006,2011
#> 2 2 12 3 2002,2007,2012
#> 3 3 13 3 2003,2008,2013
#> 4 4 14 3 2004,2009,2014
#> 5 5 15 3 2005,2010,2015
#> 6 1 16 1 2016
#> 7 2 17 1 2017
#> 8 3 18 1 2018
#> 9 4 19 1 2019
#> 10 5 20 1 2020
Created on 2019-01-21 by the reprex package (v0.2.1)