Search code examples
rdataframedplyraggregateposixct

How to convert weather data recorded at 15 minutes interval to hourly data?


Our weather station recorded temperature data at 15 minutes interval. I'd like to calculate hourly minimum, daily maximum and daily mean temperatures. How can I do this in R programming language, preferably in dplyr syntax?

My data looks like this

                   date temperature
1   2014-05-26 11:45:00        25.0
2   2014-05-26 12:00:00        25.2
3   2014-05-26 12:15:00        25.3
4   2014-05-26 12:30:00        25.1
5   2014-05-26 12:45:00        25.4
[...]
96  2014-05-27 11:30:00        26.3
97  2014-05-27 11:45:00        25.7
98  2014-05-27 12:00:00        24.9
99  2014-05-27 12:15:00        24.9
100 2014-05-27 12:30:00        26.0

Here is the reproducible example. The dput function changes date format and mdy_hm won't parse it properly. I think this the reason some of the answers are not working for me.

df <- structure(list(date = structure(c(1401104700, 1401105600, 1401106500, 
1401107400, 1401108300, 1401109200, 1401110100, 1401111000, 1401111900, 
1401112800, 1401113700, 1401114600, 1401115500, 1401116400, 1401117300, 
1401118200, 1401119100, 1401120000, 1401120900, 1401121800, 1401122700, 
1401123600, 1401124500, 1401125400, 1401126300, 1401127200, 1401128100, 
1401129000, 1401129900, 1401130800, 1401131700, 1401132600, 1401133500, 
1401134400, 1401135300, 1401136200, 1401137100, 1401138000, 1401138900, 
1401139800, 1401140700, 1401141600, 1401142500, 1401143400, 1401144300, 
1401145200, 1401146100, 1401147000, 1401147900, 1401148800, 1401149700, 
1401150600, 1401151500, 1401152400, 1401153300, 1401154200, 1401155100, 
1401156000, 1401156900, 1401157800, 1401158700, 1401159600, 1401160500, 
1401161400, 1401162300, 1401163200, 1401164100, 1401165000, 1401165900, 
1401166800, 1401167700, 1401168600, 1401169500, 1401170400, 1401171300, 
1401172200, 1401173100, 1401174000, 1401174900, 1401175800, 1401176700, 
1401177600, 1401178500, 1401179400, 1401180300, 1401181200, 1401182100, 
1401183000, 1401183900, 1401184800, 1401185700, 1401186600, 1401187500, 
1401188400, 1401189300, 1401190200, 1401191100, 1401192000, 1401192900, 
1401193800), tzone = "UTC", class = c("POSIXct", "POSIXt")), 
    temperature = c(25, 25.2, 25.3, 25.1, 25.4, 26, 25.9, 25.6, 
    26.8, 27.8, 26.8, 26, 26, 26.3, 27, 27, 26.2, 25.8, 24.9, 
    25.1, 26.3, 25.6, 25.3, 25.2, 25.1, 24.8, 24.7, 24, 23, 22.7, 
    22.5, 22.5, 22.2, 21.9, 21.5, 21.1, 20.8, 20.5, 20.3, 20.3, 
    20.2, 20, 19.8, 19.6, 19.2, 19.1, 19.1, 18.9, 18.8, 18.6, 
    18.3, 18.2, 18.2, 18.2, 18.1, 17.9, 17.8, 17.7, 17.8, 18, 
    18.1, 18, 18.1, 18.6, 18.7, 18.5, 18.3, 18.1, 18.1, 18.6, 
    18.8, 18.6, 18.6, 18.3, 18.2, 18, 17.8, 18, 18.2, 18.9, 19.8, 
    19.6, 19.5, 19.7, 20.2, 21.5, 22.4, 23, 24, 23.3, 23.2, 23.7, 
    24.5, 24.8, 24.9, 26.3, 25.7, 24.9, 24.9, 26)), row.names = c(NA, 
-100L), class = c("tbl_df", "tbl", "data.frame"))

df$date = as.POSIXct(df$date)

I used the following code to convert data recorded at 15 minutes to daily, but I'd like to change to hourly.

library(dplyr)
library(lubridate)
df %>%
   group_by(date = as.Date(date, "%m/%d/%Y")) %>%       # how to group by hour?
    summarise(min_temp = min(temperature, na.rm = TRUE),
              max_temp = max(temperature, na.rm = TRUE),
              mean_temp = mean(temperature, na.rm = TRUE)) %>%
              ungroup()

Solution

  • An approach with round_date to get hourly grouping. Using toy data.

    (... to make sure your date is of class date use: df$date <- as.POSIXct(df$date, format="%m/%d/%Y %I:%M %p"))

    EDIT, using posted df

    library(dplyr)
    library(lubridate)
    
    df %>% 
      group_by(grp = cumsum(format(round_date(date, "hour"), "%H:%M") == 
                                     format(date, "%H:%M"))) %>% 
      summarize(date = unique(format(date, "%m/%d/%Y %I:00 %p")), 
                max_temperature = max(temperature), 
                min_temerature = min(temperature), 
                mean_temperature = mean(temperature)) %>% 
      select(-grp)
    # A tibble: 26 × 4
       date                max_temperature min_temerature mean_temperature
       <chr>                         <dbl>          <dbl>            <dbl>
     1 05/26/2014 11:00 AM            25             25               25  
     2 05/26/2014 12:00 PM            25.4           25.1             25.2
     3 05/26/2014 01:00 PM            26.8           25.6             26.1
     4 05/26/2014 02:00 PM            27.8           26               26.6
     5 05/26/2014 03:00 PM            27             26.2             26.6
     6 05/26/2014 04:00 PM            26.3           24.9             25.5
     7 05/26/2014 05:00 PM            25.6           25.1             25.3
     8 05/26/2014 06:00 PM            24.8           23               24.1
     9 05/26/2014 07:00 PM            22.7           22.2             22.5
    10 05/26/2014 08:00 PM            21.9           20.8             21.3
    # … with 16 more rows
    # ℹ Use `print(n = ...)` to see more rows
    

    If you actually want a summarized table over all hours add one additional summarize

    df %>% 
      group_by(grp = cumsum(format(round_date(date, "hour"), "%H:%M") == 
                                     format(date, "%H:%M"))) %>% 
      summarize(Hour = unique(format(date, "%H:00")), 
                max_temperature = max(temperature), 
                min_temperature = min(temperature), 
                mean_temperature = mean(temperature)) %>% 
      select(-grp) %>% 
      summarize(across(ends_with("temperature"), mean), .by = Hour) %>%
      arrange(Hour)
    # A tibble: 24 × 4
       Hour  max_temperature min_temperature mean_temperature
       <chr>           <dbl>           <dbl>            <dbl>
     1 00:00            18.6            18.2             18.3
     2 01:00            18.2            17.8             18  
     3 02:00            18.1            17.7             17.9
     4 03:00            18.7            18               18.4
     5 04:00            18.5            18.1             18.2
     6 05:00            18.8            18.6             18.7
     7 06:00            18.3            17.8             18.1
     8 07:00            19.8            18               18.7
     9 08:00            20.2            19.5             19.8
    10 09:00            24              21.5             22.7
    # … with 14 more rows
    # ℹ Use `print(n = ...)` to see more rows