Search code examples
routliers

How to cross-reference tibbles in R?


library(nycflights13)
library(tidyverse)

My task is

Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error).

I have generated a tibble with the average flight times between every two airports:

# A tibble: 224 x 3
# Groups:   origin [?]
   origin  dest mean_time
    <chr> <chr>     <dbl>
 1    EWR   ALB  31.78708
 2    EWR   ANC 413.12500
 3    EWR   ATL 111.99385
 4    EWR   AUS 211.24765
 5    EWR   AVL  89.79681
 6    EWR   BDL  25.46602
 7    EWR   BNA 114.50915
 8    EWR   BOS  40.31275
 9    EWR   BQN 196.17288
10    EWR   BTV  46.25734
# ... with 214 more rows

Now I need to sweep through flights and extract all rows, whose air_time is outside say (mean_time/2, mean_time*2). How do I do that?


Solution

  • You would first join that average flight time data frame onto your original flights data and then apply the filter. Something like this should work.

    library(nycflights13)
    library(tidyverse)
    
    data("flights")
    
    #get mean time
    mean_time <- flights %>%
        group_by(origin, dest) %>%
        summarise(mean_time = mean(air_time, na.rm = TRUE))
    
    #join mean time to original data
    df <- left_join(flights, mean_time)
    
    flag_flights <- df %>%
        filter(air_time <= (mean_time / 2) | air_time >= (mean_time * 2))
    > flag_flights
    # A tibble: 29 x 20
        year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin  dest air_time distance  hour minute
       <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl>   <chr>  <int>   <chr>  <chr> <chr>    <dbl>    <dbl> <dbl>  <dbl>
     1  2013     1    16      635            608        27      916            725       111      UA    541  N837UA    EWR   BOS       81      200     6      8
     2  2013     1    21     1851           1900        -9     2034           2012        22      US   2140  N956UW    LGA   BOS       76      184    19      0
     3  2013     1    28     1917           1825        52     2118           1935       103      US   1860  N755US    LGA   PHL       75       96    18     25
     4  2013    10     7     1059           1105        -6     1306           1215        51      MQ   3230  N524MQ    JFK   DCA       96      213    11      5
     5  2013    10    10      950            959        -9     1155           1115        40      EV   5711  N829AS    JFK   IAD       97      228     9     59
     6  2013     2    17      841            840         1     1044           1003        41      9E   3422  N913XJ    JFK   BOS       86      187     8     40
     7  2013     3     8     1136           1001        95     1409           1116       173      UA   1240  N17730    EWR   BOS       82      200    10      1
     8  2013     3     8     1246           1245         1     1552           1350       122      AA   1850  N3FEAA    JFK   BOS       80      187    12     45
     9  2013     3    12     1607           1500        67     1803           1608       115      US   2132  N946UW    LGA   BOS       77      184    15      0
    10  2013     3    12     1612           1557        15     1808           1720        48      UA   1116  N37252    EWR   BOS       81      200    15     57
    # ... with 19 more rows, and 2 more variables: time_hour <dttm>, mean_time <dbl>