Search code examples
rindexinggroup-summaries

Match and summarize dataframe in R


I have a dataframe with timeseries data columns and start year and end year.

df = data.frame(y2000=c(12,636),y2001=c(234, 76),y2002=c(3434, 46),y2003=c(36,35),y2004=c(6, 64),   y2005=c(56,65), y2006=c(43,65), y2007=c( 6, 56),y2008=c( 64, 66),y2009=c(63, 5656),y2010 = c(65,54),startyear= c(2006, 2001), endyear= c(2009, 2005))

For each row I want to calculate mean value within the start and end years and also before and after the start and end period. The desired output looks like:

y2000   y2001   y2002   y2003   y2004   y2005   y2006   y2007   y2008   y2009   y2010   startyear   endyear before_mean within_mean  after_mean
12  234 3434    36  6   56  43  6   64  63  65  2006    2009    629.6666667     44  65
636 76  46  35  64  65  65  56  66  5656    54  2001    2005    636 57.2        1179.4

I have tried different match and index technique but can't wrap my mind around this one.


Solution

  • 1. dplyr/tidyr

    It may be better to convert the 'wide' format to 'long' format. We could use dplyr/tidyr to get the mean. Create a 'ind' column, reshape the data to 'long' using gather, split the 'variable' column into two columns ('var1', 'var2') with extract, group by 'ind', get the mean values of the 'value' column after subsetting it based on the different logical index created (i.e. var2 < startyear, var2 >= startyear & var2 <= endyear, and var2 >endyear)

    library(dplyr)
    library(tidyr)
    
    dS <-  df %>%
              mutate(ind=row_number()) %>%
              gather(variable, value, starts_with('y')) %>%
              extract(variable, c('var1', 'var2'), '([^0-9]+)([0-9]+)',
                            convert=TRUE) %>%
              group_by(ind) %>%
              summarise(before_mean= mean(value[var2 < startyear]), 
                       within_mean = mean(value[var2 >= startyear & 
                                                var2 <= endyear]),
                       after_mean=mean(value[var2 >endyear])) %>% 
             as.data.frame()
    
    nm1 <-  paste(c('before', 'within', 'after'), 'mean', sep="_")
    dS
    #   ind before_mean within_mean after_mean
    #1   1    629.6667        44.0       65.0
    #2   2    636.0000        57.2     1179.4
    

    We can create additional columns in 'df' from the above output

    df[nm1] <- dS
    

    2. base R

    We can use base R methods and without changing the format of the dataset. From the original dataset ('df'), make an index ('indx') of numeric column names, remove the non-numeric part and convert to numeric ('v1').

     indx <- grep('\\d+', names(df))
     v1 <- as.numeric(sub('[^0-9]+', '', names(df)[indx]))
    

    Loop the rows of 'df' (lapply), match the 'startyear' with 'v1', use that index ('i1') to get the columns, unlist, and calculate the mean. The same can be done by matching the 'endyear' with 'v1' to get the index ('i2'). Based on 'i1', and 'i2', calculate the 'within_mean' and 'after_mean'. rbind the list elements and assign the output to new columns ('nm1') in 'df'.

    df[nm1] <- do.call(rbind,lapply(1:nrow(df), function(i) {
           i1 <- match(df$startyear[i], v1)
           before_mean<-  mean(unlist(df[i,1:(i1-1),drop=FALSE]))
           i2 <- match(df$endyear[i], v1)
           within_mean <- mean(unlist(df[i,i2:i1]))
          after_mean <- mean(unlist(df[i,match(v1[(i2+1):length(v1)],v1)]))
           data.frame(before_mean,within_mean, after_mean) }))
     df[nm1]
     #    before_mean within_mean after_mean
     #1    629.6667        44.0       65.0
     #2    636.0000        57.2     1179.4