Search code examples
rdataframelag

R - Grouped data with DoD change


Say I have a raw dataset (already in data frame and I can convert that easily to xts.data.table with as.xts.data.table), the DF is like the following:

Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature
-------------------------
2018-02-03 | New York City | NY | US | 18 | 22 | 19
2018-02-03 | London | LDN |UK | 10 | 25 | 15
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29
2018-02-02 | New York City | NY | US | 12 | 30 | 18
2018-02-02 | London | LDN | UK | 12 | 15 | 14
2018-02-02 | Singapore | SG | SG | 27 | 31 | 30

and so on (many more cities and many more days).

And I would like to make this to show both the current day temperature and the day over day change from the previous day, together with the other info on the city (state, country). i.e., the new data frame should be something like (from the example above):

Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature| ChangeInDailyMin | ChangeInDailyMax | ChangeInDailyMedian
-------------------------
2018-02-03 | New York City | NY | US | 18 | 22 | 19 | 6 | -8 | 1
2018-02-03 | London | LDN |UK | 10 | 25 | 15 | -2 | -10 | 1
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29 | 1 | 1 | -1
2018-02-03 | New York City | NY | US | ...

and so on. i.e., add 3 more columns to show the day over day change.

Note that in the dataframe I may not have data everyday, however my change is defined as the differences between temperature on day t - temperature on the most recent date where I have data on the temperature.

I tried to use the shift function but R was complaining about the := sign.

Is there any way in R I could get this to work?

Thanks!


Solution

  • You can use dplyr::mutate_at and lubridate package to transform data in desired format. The data needs to be arranged in Date format and difference of current record with previous record can be taken with help of dplyr::lag function.

    library(dplyr)
    library(lubridate)
    
    df %>% mutate_if(is.character, funs(trimws)) %>%  #Trim any blank spaces
      mutate(Date = ymd(Date)) %>%                    #Convert to Date/Time
      group_by(City, State, Country) %>%               
      arrange(City, State, Country, Date) %>%         #Order data date
      mutate_at(vars(starts_with("Daily")), funs(Change = . - lag(.))) %>%
      filter(!is.na(DailyMinTemperature_Change))
    

    Result:

    # # A tibble: 3 x 10
    # # Groups: City, State, Country [3]
    # Date       City          State Country DailyMinTemperature DailyMaxTemperature DailyMedianTemperature DailyMinTemperature_Change DailyMaxT~ DailyMed~
    #   <date>     <chr>         <chr> <chr>                 <dbl>               <dbl>                  <int>                      <dbl>      <dbl>     <int>
    # 1 2018-02-03 London        LDN   UK                     10.0                25.0                     15                      -2.00      10.0          1
    # 2 2018-02-03 New York City NY    US                     18.0                22.0                     19                       6.00     - 8.00         1
    # 3 2018-02-03 Singapore     SG    SG                     28.0                32.0                     29                       1.00       1.00        -1
    # 
    

    Data:

    df <- read.table(text = 
    "Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature
    2018-02-03 | New York City | NY | US | 18 | 22 | 19
    2018-02-03 | London | LDN |UK | 10 | 25 | 15
    2018-02-03 | Singapore | SG | SG | 28 | 32 | 29
    2018-02-02 | New York City | NY | US | 12 | 30 | 18
    2018-02-02 | London | LDN | UK | 12 | 15 | 14
    2018-02-02 | Singapore | SG | SG | 27 | 31 | 30",
    header = TRUE, stringsAsFactors = FALSE, sep = "|")