Search code examples
rsubtraction

How to average correct data: Temperature - Average Temperature per day and station


I have to subtract the daily average temperature to each temperature value during the day also taking in account the station it comes from, along 9 months. I already have the average, my data is something like this:

 >>> df1
   Station     Date                Temperature
0  Station1    2022-05-1 9:30:00   7,4
1  Station1    2022-05-1 9:45:00   7,45
2  Station1    2022-05-1 10:00:00  8,2
3  Station1    2022-05-1 10:15:00  8,4
4  Station1    2022-05-1 10:30:00  8,9
5  Station1    2022-05-1 9:30:00   7,5
6  Station2    2022-05-1 9:45:00   7,56
7  Station2    2022-05-1 10:00:00  8,4
8  Station2    2022-05-1 10:15:00  8,7
9  Station2    2022-05-1 10:30:00  8,1
10 ...

>>> df2
   Station     Date        AverageTemperaturePerDayAndStation
0  Station1    2022-05-1   8
1  Station1    2022-05-2   8,3
2  Station1    2022-05-3   8,6
3  Station1    2022-05-4   8,4
4  Station1    2022-05-5   7,9
5  Station2    2022-05-1   6
6  Station2    2022-05-2   7,3
7  Station2    2022-05-3   8,6
8  Station2    2022-05-4   7,4
9  Station2    2022-05-5   6,9
10 ...

So I want R to substranct Temperature - AverageTemperaturePerDayAndStation just like this:

>>> df3
   Station     Date                CorrectedTemperature 
0  Station1    2022-05-1 9:30:00   7,4  - 8
1  Station1    2022-05-1 9:45:00   7,45 - 8
2  Station1    2022-05-1 10:00:00  8,2  - 8
3  Station1    2022-05-1 10:15:00  8,4  - 8
4  Station1    2022-05-1 10:30:00  8,9  - 8
5  Station1    2022-05-1 9:30:00   7,5  - 8
6  Station2    2022-05-1 9:45:00   7,56 - 6
7  Station2    2022-05-1 10:00:00  8,4  - 6
8  Station2    2022-05-1 10:15:00  8,7  - 6
9  Station2    2022-05-1 10:30:00  8,1  - 6
10 ...

Solution

  • I would recommend the following:

    #generate dataframes df1 and df2
    df1 <- data.frame(Station =c(rep("Station1",5), rep("Station2",5)),
                      Date = c("2022-05-1 9:30:00","2022-05-1 9:45:00","2022-05-1 10:00:00","2022-05-1 10:15:00", "2022-05-1 10:30:00",
                               "2022-05-1 9:30:00","2022-05-1 9:45:00","2022-05-1 10:00:00","2022-05-1 10:15:00", "2022-05-1 10:30:00"),
                      Temperature = c(7.4, 7.45, 8.2, 8.4, 8.9, 7.5, 7.56, 8.4, 8.7, 8.1))
    df2 <- data.frame(Station=c(rep("Station1",5), rep("Station2",5)),
                      Date = c("2022-05-01","2022-05-02","2022-05-03","2022-05-04","2022-05-05",
                               "2022-05-01","2022-05-02","2022-05-03","2022-05-04","2022-05-05"),
                      AverageTemperaturePerDayAndStation =c(8, 8.3, 8.6, 8.4, 7.9, 6, 7.3, 8.6, 7.4, 6.9))
    
    #save Time and date in separate columns
    df1$Time <- format(as.POSIXct(df1$Date), format = "%H:%M:%S")
    df1$Date <- as.Date(df1$Date)
    
    #change format of Date in df2 with as.Date
    df2$Date <- as.Date(df2$Date)
    
    #use left join (i.e. keep all entries from the first dataframe) and join by both Date and Station
    df3 <- dplyr::left_join(df1, df2, by =c("Date","Station"))
    

    Then, you can calculate the new column CorrectedTemperature.

    df3$CorrectedTemperature <- df3$Temperature - df3$AverageTemperaturePerDayAndStation
    

    Hope this could help.