I have a dataframe with timeseries data columns and start year and end year.
df = data.frame(y2000=c(12,636),y2001=c(234, 76),y2002=c(3434, 46),y2003=c(36,35),y2004=c(6, 64), y2005=c(56,65), y2006=c(43,65), y2007=c( 6, 56),y2008=c( 64, 66),y2009=c(63, 5656),y2010 = c(65,54),startyear= c(2006, 2001), endyear= c(2009, 2005))
For each row I want to calculate mean value within the start and end years and also before and after the start and end period. The desired output looks like:
y2000 y2001 y2002 y2003 y2004 y2005 y2006 y2007 y2008 y2009 y2010 startyear endyear before_mean within_mean after_mean
12 234 3434 36 6 56 43 6 64 63 65 2006 2009 629.6666667 44 65
636 76 46 35 64 65 65 56 66 5656 54 2001 2005 636 57.2 1179.4
I have tried different match and index technique but can't wrap my mind around this one.
1. dplyr/tidyr
It may be better to convert the 'wide' format to 'long' format. We could use dplyr/tidyr
to get the mean
. Create a 'ind' column, reshape the data to 'long' using gather
, split the 'variable' column into two columns ('var1', 'var2') with extract
, group by 'ind', get the mean
values of the 'value' column after subsetting it based on the different logical index created (i.e. var2 < startyear
, var2 >= startyear & var2 <= endyear
, and var2 >endyear
)
library(dplyr)
library(tidyr)
dS <- df %>%
mutate(ind=row_number()) %>%
gather(variable, value, starts_with('y')) %>%
extract(variable, c('var1', 'var2'), '([^0-9]+)([0-9]+)',
convert=TRUE) %>%
group_by(ind) %>%
summarise(before_mean= mean(value[var2 < startyear]),
within_mean = mean(value[var2 >= startyear &
var2 <= endyear]),
after_mean=mean(value[var2 >endyear])) %>%
as.data.frame()
nm1 <- paste(c('before', 'within', 'after'), 'mean', sep="_")
dS
# ind before_mean within_mean after_mean
#1 1 629.6667 44.0 65.0
#2 2 636.0000 57.2 1179.4
We can create additional columns in 'df' from the above output
df[nm1] <- dS
2. base R
We can use base R
methods and without changing the format of the dataset. From the original dataset ('df'), make an index ('indx') of numeric column names, remove the non-numeric part and convert to numeric ('v1').
indx <- grep('\\d+', names(df))
v1 <- as.numeric(sub('[^0-9]+', '', names(df)[indx]))
Loop the rows of 'df' (lapply
), match
the 'startyear' with 'v1', use that index ('i1') to get the columns, unlist
, and calculate the mean
. The same can be done by matching the 'endyear' with 'v1' to get the index ('i2'). Based on 'i1', and 'i2', calculate the 'within_mean' and 'after_mean'. rbind
the list elements and assign the output to new columns ('nm1') in 'df'.
df[nm1] <- do.call(rbind,lapply(1:nrow(df), function(i) {
i1 <- match(df$startyear[i], v1)
before_mean<- mean(unlist(df[i,1:(i1-1),drop=FALSE]))
i2 <- match(df$endyear[i], v1)
within_mean <- mean(unlist(df[i,i2:i1]))
after_mean <- mean(unlist(df[i,match(v1[(i2+1):length(v1)],v1)]))
data.frame(before_mean,within_mean, after_mean) }))
df[nm1]
# before_mean within_mean after_mean
#1 629.6667 44.0 65.0
#2 636.0000 57.2 1179.4