Search code examples
rfor-loopsummarycbind

Summarize selected entries in a data set with a query


I am still pretty new to R and try to summarize data in a specific way. To illustrate it here, I am using the weather data from the nasaweather package. As an example, I would like to get the average temperature on a specific day, and display it for the 3 origins and the 12 months contained in this data set.

I think I can ge it done using the following code, where I specify the day I am interested in, create an empty data frame to be filled in, and then run a for loop through the months where I calculate the average temperature for each origin, cbind them with the month, and rbind them to the data frame. Finally I adjust the column names and print out the result:

library(nasaweather)
library(magrittr)
library(dplyr)

query_day = 15
data_output <- data.frame(month = numeric(),
                EWR = numeric(), 
                JFK = numeric(),
                LGA = numeric())

for (i in 1:12) {
  data_subset <- weather %>%
    filter(day == query_day, month == i) %>%
    summarize(
      EWR = mean(temp[origin == "EWR"]),
      JFK = mean(temp[origin == "JFK"]),
      LGA = mean(temp[origin == "LGA"]))
  data_output <- rbind(data_output, cbind(i, data_subset))
  rm(data_subset)
}

names(data_output) <- c("month", "EWR", "JFK", "LGA")
print(data_output)

In my hands this yields the following:

   month     EWR     JFK     LGA
1      1 39.3725 39.0875 38.9150
2      2 42.1625 39.3425 42.9050
3      3 37.4150 36.7775 37.3025
4      4 50.1275 48.1550 49.2050
5      5 58.8725 55.7150 59.1575
6      6 70.7825 70.2950 71.5700
7      7 86.9900 85.1225 87.2000
8      8 69.2075 69.0725 69.9425
9      9 60.6350 61.2125 61.7375
10    10 59.8850 58.3850 60.5150
11    11 45.7475 45.1700 49.0700
12    12 32.4950 38.0975 34.0325

which is exactly what I want. I just figured that my code seems to be far too complicated and would like to ask whether there is an easier way to get this job done?


Solution

  • There's a variety of problems with your code... but the main one is the fact that you didn't group_by first. As soon as you include that, this becomes easy peesy. Look at my comments to your code first, and then the finalized code at the bottom:

    library(nasaweather) ## Wrong package
    # library(magrittr) ## not needed, it's called by dplyr
    library(dplyr)
    
    query_day = 15
    #  data_output <- data.frame(month = numeric(), ## We won't need to specify this explicitly 
    ## (but you are right that you should specify this in a for loop. Go one step
    ## further by actually telling the data.frame how many rows to expect. 
    ## But not in this case cause we won't use for loop)
                            #  EWR = numeric(), 
                            #  JFK = numeric(),
                            #  LGA = numeric())
    
    for (i in 1:12) { ## You don't need to do a for loop... you can do it with the summarize_by function.
      data_subset <- weather %>%
        filter(day == query_day, month == i) %>%
        summarize(       ## Before doing summarize, you need a group_by to say what to summarize_by
          EWR = mean(temp[origin == "EWR"]),
          JFK = mean(temp[origin == "JFK"]),
          LGA = mean(temp[origin == "LGA"]))
      data_output <- rbind(data_output, cbind(i, data_subset)) ## If you're doing the group_by, this step isn't required. 
      # rm(data_subset) ## You don't have to remove temporary datasets...
    ## When the for loop ends, they are automatically removed.
    }
    
    names(data_output) <- c("month", "EWR", "JFK", "LGA") 
    print(data_output) 
    
    ################### Better code:
    library(nycflights13) ## your the package you waant is nycflights13... not nasaweather
    library(dplyr)
    
    query_day = 15
    
    weather %>%
      filter(day == query_day) %>%
      group_by(month) %>%
      summarize(
          EWR = mean(temp[origin == "EWR"]),
          JFK = mean(temp[origin == "JFK"]),
          LGA = mean(temp[origin == "LGA"])) -> data_output
    
    data_output
    

    Yields:

    > data_output
    # A tibble: 12 × 4
       month     EWR     JFK     LGA
       <dbl>   <dbl>   <dbl>   <dbl>
    1      1 39.3725 39.0875 38.9150
    2      2 42.1625 39.3425 42.9050
    3      3 37.4150 36.7775 37.3025
    4      4 50.1275 48.1550 49.2050
    5      5 58.8725 55.7150 59.1575
    6      6 70.7825 70.2950 71.5700
    7      7 86.9900 85.1225 87.2000
    8      8 69.2075 69.0725 69.9425
    9      9 60.6350 61.2125 61.7375
    10    10 59.8850 58.3850 60.5150
    11    11 45.7475 45.1700 49.0700
    12    12 32.4950 38.0975 34.0325