Search code examples
rdaterangecriteriacalculated-columns

How to calculate column average with range criteria given by two other variables?


Below is a sample dataset.

id<-c(1,2,3,4)
start<-c("Jul 2001","Jun 2001","May 2001","May 2001")
end<-c("Aug 2001","Sep 2001","Jul 2001","Nov 2001")

X1 <- runif(n=4, min=1, max=10)
X2 <- runif(n=4, min=1, max=10)
X3 <- runif(n=4, min=1, max=10)
X4 <- runif(n=4, min=1, max=10)
X5 <- runif(n=4, min=1, max=10)
X6 <- runif(n=4, min=1, max=10)
X7 <- runif(n=4, min=1, max=10)
X8 <- runif(n=4, min=1, max=10)
X9 <- runif(n=4, min=1, max=10)
X10 <- runif(n=4, min=1, max=10)
X11 <- runif(n=4, min=1, max=10)
X12 <- runif(n=4, min=1, max=10)

df <- data.frame(id,start,end,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12)

colnames(df)<-c("id","start","end","Jan 2001","Feb 2001","Mar 2001","Apr 2001","May 2001","Jun 2001",
            "Jul 2001","Aug 2001","Sep 2001","Oct 2001","Nov 2001","Dec 2001")

df
  id    start      end Jan 2001 Feb 2001 Mar 2001 Apr 2001 May 2001 Jun 2001 Jul 2001
1  1 Jul 2001 Aug 2001 6.384065 2.537499 6.562912 2.423018 6.908553 7.287870 7.089380
2  2 Jun 2001 Sep 2001 8.594478 2.824641 8.430340 8.508628 2.806191 6.989283 7.375734
3  3 May 2001 Jul 2001 1.657620 2.548688 4.172271 8.448615 8.426294 8.832702 8.294754
4  4 May 2001 Nov 2001 5.176202 4.827898 7.044409 9.117314 2.053103 2.610455 2.601701
  Aug 2001 Sep 2001 Oct 2001 Nov 2001 Dec 2001
1 7.393482 1.865180 5.316736 6.737959 8.783017
2 7.816893 4.021888 7.086448 1.728219 1.553020
3 5.443161 7.489278 9.848638 7.072435 1.294177
4 8.853365 8.899155 5.768139 1.414094 2.322848

I would like to calculate the column mean for each id, from respective start to end month (including the start and end). E.g.

id start    end        average
2  Jun 2001 Sep 2001   average of Jun, Jul, Aug and Sep 2001

My first thought is to assign index to each month. So that there is no need to deal with the yearmon data format. Seems to make it easier.

# generate index for month data
df.i <- df
df.i$start.i[df.i$start == "Jan 2001"] <- 1
df.i$start.i[df.i$start == "Feb 2001"] <- 2
df.i$start.i[df.i$start == "Mar 2001"] <- 3
df.i$start.i[df.i$start == "Apr 2001"] <- 4
df.i$start.i[df.i$start == "May 2001"] <- 5
df.i$start.i[df.i$start == "Jun 2001"] <- 6
df.i$start.i[df.i$start == "Jul 2001"] <- 7
df.i$start.i[df.i$start == "Aug 2001"] <- 8
df.i$start.i[df.i$start == "Sep 2001"] <- 9
df.i$start.i[df.i$start == "Oct 2001"] <- 10
df.i$start.i[df.i$start == "Nov 2001"] <- 11
df.i$start.i[df.i$start == "Dec 2001"] <- 12

df.i$end.i[df.i$end == "Jan 2001"] <- 1
df.i$end.i[df.i$end == "Feb 2001"] <- 2
df.i$end.i[df.i$end == "Mar 2001"] <- 3
df.i$end.i[df.i$end == "Apr 2001"] <- 4
df.i$end.i[df.i$end == "May 2001"] <- 5
df.i$end.i[df.i$end == "Jun 2001"] <- 6
df.i$end.i[df.i$end == "Jul 2001"] <- 7
df.i$end.i[df.i$end == "Aug 2001"] <- 8
df.i$end.i[df.i$end == "Sep 2001"] <- 9
df.i$end.i[df.i$end == "Oct 2001"] <- 10
df.i$end.i[df.i$end == "Nov 2001"] <- 11
df.i$end.i[df.i$end == "Dec 2001"] <- 12


colnames(df.i)<-c("id","start","end","1","2","3","4","5","6",
            "7","8","9","10","11","12","start.i","end.i")


 df.i
  id    start      end        1        2        3        4        5        6        7
1  1 Jul 2001 Aug 2001 6.384065 2.537499 6.562912 2.423018 6.908553 7.287870 7.089380
2  2 Jun 2001 Sep 2001 8.594478 2.824641 8.430340 8.508628 2.806191 6.989283 7.375734
3  3 May 2001 Jul 2001 1.657620 2.548688 4.172271 8.448615 8.426294 8.832702 8.294754
4  4 May 2001 Nov 2001 5.176202 4.827898 7.044409 9.117314 2.053103 2.610455 2.601701
          8        9       10       11       12 start.i end.i
1 7.393482 1.865180 5.316736 6.737959 8.783017       7     8
2 7.816893 4.021888 7.086448 1.728219 1.553020       6     9
3 5.443161 7.489278 9.848638 7.072435 1.294177       5     7
4 8.853365 8.899155 5.768139 1.414094 2.322848       5    11

Thank you.


Solution

  • Your data, set seed for reproducibility.

    id<-c(1,2,3,4)
    start<-c("Jul 2001","Jun 2001","May 2001","May 2001")
    end<-c("Aug 2001","Sep 2001","Jul 2001","Nov 2001")
    set.seed(123)
    df <- data.frame(id, start, end, matrix(runif(n=4*12, min=1, max=10), ncol=12))
    df$start <- as.character(df$start)
    df$end <- as.character(df$end)
    colnames(df)<-c("id", "start", "end", paste(month.abb, 2001))
    

    You can try apply. This will "loop" through every row, subsetting by the names of start and end. Important, start & end names must match the colnames of df. And finally the mean is calculated over the subset.

    apply(df, 1, function(x, y) mean(as.numeric(x[which(y == x[2]):which(y == x[3])])), colnames(df))
    [1] 5.251895 6.273809 5.537480 6.815905