Search code examples
rgroup-bymean

Calculating MAD in two different ways in R return different results


(I have posted a similar question at Cross Validated, but I believe this is more fitting for Stack Overflow).


I have a large dataframe data with following columns:

date        time        orig      new
2001-01-01  00:30:00    345       856
2001-01-01  00:32:43    4575      9261
2001-01-01  00:51:07    6453      2352
...
2001-01-01  23:57:51    421       168
2001-01-02  00:06:14    5612      3462
...
2001-01-31  23:49:11    14420     8992
2001-02-01  00:04:32    213       521
...

I want to calculate the monthly aggregated MAD, which can be calculated by mean(abs(orig - new)) when grouped by month. Ideally, at the end, I want the solutions (dataframe) in a following form:

month       mad
2001-01-01  7452.124
2001-02-01  3946.734
2001-03-01  995.938
...

I calculated the monthly MAD in two different ways.

Approach 1

I grouped data by month and took an average of the summed absolute differences (which is a "mathematical" way to do it, as I explained):

data %>%
   group_by(
       month = lubridate::floor_date(date, 'month')
   ) %>%
   summarise(mad = mean(abs(orig - new)))

Approach 2

I grouped data by hour and got the MAD grouped by hour, and then re-grouped it by month and took an average. This is counter-intuitive, but I used the hourly grouped dataframe for other analyses and tried computing the monthly MAD from this dataframe directly.

data_grouped_by_hour <- data %>%
   group_by(
        day = lubridate::floor_date(date, 'day'),
        hour = as.POSIXlt(time)$hour
   ) %>%
   summarise(mad = mean(abs(orig - new)))

data_grouped_by_hour %>%
   group_by(
       month = lubridate::floor_date(date, 'month')
   ) %>%
   summarise(mad = mean(mad))

As hinted from the post title, these approaches return different values. I assume my first approach is correct, as it is more concise and follows the accurate concept, but I wonder why the second approach does not return the same value.

I want to note that I would prefer Approach 2 so that I don't have to make separate tables for every analysis with different time unit. Any insights are appreciated.


Solution

  • Because average of average is not the same as complete average.

    This is a common misconception. Let's try to understand with the help of an example -

    Consider a list with 2 elements a and b

    x <- list(a = c(1, 5, 4, 3, 2, 8), b = c(6, 5))
    

    Now, similar to your question we will take average in 2 ways -

    1. Average of all the values of x
    res1 <- mean(unlist(x))
    res1
    #[1] 4.25
    
    1. Average of each element separately and then complete average.
    sapply(x, mean)
    #       a        b 
    #3.833333 5.500000 
    
    res2 <- mean(sapply(x, mean))
    res2
    #[1] 4.666667
    

    Notice that res1 and res2 has different values because the 2nd case is average of averages.

    The same logic applies in your case as well when you take daily average and then monthly which is average of averages.