Search code examples
rdplyrgroup-bymediansummarize

Calculate the median date based on two grouping conditions


I have the following data frame:

> head(df)
# A tibble: 6 x 6
# Groups:   lat, decade [2]
    lat  long date       year  decade    month_day
  <dbl> <dbl> <date>     <chr> <chr>     <chr>    
1    55    18 1952-02-03 1952  1950-1959 02-03    
2    55    18 1958-02-08 1958  1950-1959 02-08    
3    55    18 1958-02-08 1958  1950-1959 02-08    
4    55    18 1958-02-08 1958  1950-1959 02-08    
5    55    18 1965-02-07 1965  1960-1969 02-07    
6    55    18 1966-03-03 1966  1960-1969 03-03    
   


> summary(df)
          lat             long         date                year              decade         
     Min.   :55.00   Min.   :18   Min.   :1951-03-22   Length:1414        Length:1414       
     1st Qu.:56.00   1st Qu.:18   1st Qu.:1987-01-01   Class :character   Class :character  
     Median :58.00   Median :18   Median :2004-04-02   Mode  :character   Mode  :character  
     Mean   :59.07   Mean   :18   Mean   :1999-02-16                                        
     3rd Qu.:62.00   3rd Qu.:18   3rd Qu.:2014-01-01                                        
     Max.   :68.00   Max.   :18   Max.   :2021-03-28                                        
      month_day        
     Length:1414       
     Class :character  
     Mode  :character 

I would like to get the median month_day by degree of latitude (lat) and per decade

I have tried this but cannot get past an error:

df = df %>%
  group_by(lat, decade) %>%
  summarise(across(month_day, median)) %>%
  ungroup

Error in `summarise()`:
! Problem while computing `..1 = across(month_day, median)`.
Caused by error:
! `month_day` must return compatible vectors across groups.
i Result type for group 1 (lat = 55, decade = "1950-1959"): <double>.
i Result type for group 2 (lat = 55, decade = "1960-1969"): <character>.

I do not know how to solve it, thank you very much for your help.

EDIT:

> ds_filtered_median[ds_filtered_median$lat == '57', ]
# A tibble: 124 x 6
     lat  long date       year  decade    month_day
   <dbl> <dbl> <date>     <chr> <chr>     <chr>    
 1    57    18 1955-04-08 1955  1950-1959 04-08    
 2    57    18 1957-02-19 1957  1950-1959 02-19    
 3    57    18 1958-04-06 1958  1950-1959 04-06    
 4    57    18 1959-01-01 1959  1950-1959 01-01    
 5    57    18 1960-01-03 1960  1960-1969 01-03    
 6    57    18 1961-01-02 1961  1960-1969 01-02    
 7    57    18 1962-01-02 1962  1960-1969 01-02    
 8    57    18 1963-01-01 1963  1960-1969 01-01    
 9    57    18 1964-01-19 1964  1960-1969 01-19    
10    57    18 1965-01-12 1965  1960-1969 01-12    
# ... with 114 more rows

Solution

  • What you can do is convert your date to days since the start of a year. From that number you can easily calculate your median. Then convert your days back with any first of january as a reference. You can me one of on leap years though... For date manipulation I used lubridate.

    library(lubridate)
    
    data %>%
      mutate(
        date = ymd(date),
        days_since_january = as.numeric(date - ymd(paste(year(date), 1, 1, sep = "-")))
      ) %>%
      group_by(lat, decade) %>%
      summarise(across(days_since_january, median), .groups = "keep") %>%
      mutate(median_month_date = format(ymd("1960-01-01") + days(floor(days_since_january)), "%m-%d"))
    
    
    # A tibble: 2 x 4
    # Groups:   lat, decade [2]
        lat decade    days_since_january median_month_date
      <dbl> <chr>                  <dbl> <chr>            
    1    55 1950-1959                 38 02-08            
    2    55 1960-1969                 49 02-19  
    
    # A tibble: 2 x 4
    # Groups:   lat, decade [2]
        lat decade    days_since_january median_month_date
      <int> <chr>                  <dbl> <chr>            
    1    57 1950-1959               72   03-13            
    2    57 1960-1969                1.5 01-02