Search code examples
raggregate

How to compute total number per each month and year?


Hey I have a data frame of 2 columns: date and cases:

 date            cases
2021-05-01          2
2022-03-01          3
2021-05-21          8

what I want is to calculate the total number per cases in each month of each year. I've found that code:

 total_cases <- aggregate(cbind(cases)~month(date),
         data=datalab,FUN=sum)

but it is not separated by year. it calculates to total cases for each month but combines the MAY of 2021 with MAY of 2022 together and I want that separately.


Solution

  • Using months (note the s), and strftime.

    with(datalab, aggregate(datalab['cases'], list(month=months(date), year=strftime(date, '%Y')), FUN=sum))
    #   month year cases
    # 1   May 2021    10
    # 2 March 2022     3
    

    You could also use the 1st to 7th substring.

    aggregate(datalab['cases'], list(month=substr(datalab$date, 1, 7)), FUN=sum)
    #     month cases
    # 1 2021-05    10
    # 2 2022-03     3
    

    Or, more concise, though not as nice.

    aggregate(cases ~ substr(date, 1, 7), data=datalab, FUN=sum)
    #   substr(date, 1, 7) cases
    # 1            2021-05    10
    # 2            2022-03     3
    

    Update

    To also aggregate by country, just put it in the second (i.e. the by=) list; it is referenced by the with.

    with(datalab2, aggregate(datalab2['cases'], list(month=months(date), year=strftime(date, '%Y'),
                                                     country=country), FUN=sum))
    #   month year country cases
    # 1   May 2021       A    15
    # 2 March 2022       A     5
    # 3   May 2021       B    16
    # 4 March 2022       B     6
    # 5   May 2021       C    12
    # 6 March 2022       C     5
    

    Data:

    datalab <- structure(list(date = c("2021-05-01", "2022-03-01", "2021-05-21"
    ), cases = c(2L, 3L, 8L)), class = "data.frame", row.names = c(NA, 
    -3L))
    
    datalab <- transform(datalab, date=as.Date(date))
    
    datalab2 <- structure(list(date = structure(c(18748, 19052, 18768, 18748, 
    19052, 18768, 18748, 19052, 18768), class = "Date"), country = structure(c(1L, 
    1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), levels = c("A", "B", "C"), class = "factor"), 
        cases = c(7L, 5L, 8L, 9L, 6L, 7L, 2L, 5L, 10L)), class = "data.frame", row.names = c(NA, 
    -9L))