Search code examples
rdplyrlag

group_by lag returns unlagged variable


I'm completely stumped and baffled trying to create a lagged variable in order to calculate the miles travelled between two observations.

>   fuel_sheet %>% dplyr::select(vehicle, date, mileage)
+                                                       
# A tibble: 64 x 3                                      
# Groups:   vehicle [2]                                 
   vehicle         date       mileage                   
   <fct>           <date>       <dbl>                   
 1 VW T4 Campervan 2015-09-30  150631                   
 2 VW T4 Campervan 2015-10-16  150866                   
 3 VW T4 Campervan 2015-10-18  151395                   
 4 VW T4 Campervan 2015-12-06  151731                   
 5 VW T4 Campervan 2016-01-22  151968                   
 6 VW T4 Campervan 2016-01-24  152360                   
 7 VW T4 Campervan 2016-01-25  152731                   
 8 VW T4 Campervan 2016-02-21  153066                   
 9 VW T4 Campervan 2016-03-29  153500                   
10 VW T4 Campervan 2016-04-16  153751                   
# … with 54 more rows                                   
> fuel_sheet <- fuel_sheet %>%                                    
    group_by(vehicle) %>%                                       
    mutate(miles_lag = lag(mileage, n=1, order_by=date)) %>%                   
    mutate(miles = mileage - lag(mileage, n=1, order_by=date))  
> fuel_sheet %>% dplyr::select(vehicle, date, mileage, miles_lag, miles)
# A tibble: 64 x 5                                   
# Groups:   vehicle [2]                              
   vehicle         date       mileage miles_lag miles
   <fct>           <date>       <dbl>     <dbl> <dbl>
 1 VW T4 Campervan 2015-09-30  150631    150631     0
 2 VW T4 Campervan 2015-10-16  150866    150866     0
 3 VW T4 Campervan 2015-10-18  151395    151395     0
 4 VW T4 Campervan 2015-12-06  151731    151731     0
 5 VW T4 Campervan 2016-01-22  151968    151968     0
 6 VW T4 Campervan 2016-01-24  152360    152360     0
 7 VW T4 Campervan 2016-01-25  152731    152731     0
 8 VW T4 Campervan 2016-02-21  153066    153066     0
 9 VW T4 Campervan 2016-03-29  153500    153500     0
10 VW T4 Campervan 2016-04-16  153751    153751     0
# … with 54 more rows                                

If I try using diff() in place of lag to directly calculate the difference I get an array that is short by one element.

Found many threads here on SO on this sort of thing, one example is r - diff operation within a group, after a dplyr::group_by() - Stack Overflow, but can not see why my code isn't behaving as expected.

There is no conflict with plyr as its not loaded and I can not see any other conflicts that would be affecting things here.

> sessionInfo()
R version 4.0.2 (2020-06-22)                                                                            
Platform: x86_64-pc-linux-gnu (64-bit)                                                                  
Running under: Gentoo/Linux                                                                             
                                                                                                        
Matrix products: default                                                                                
BLAS:   /usr/lib64/libblas.so.3.8.0                                                                     
LAPACK: /usr/lib64/liblapack.so.3.8.0                                                                   
                                                                                                        
locale:                                                                                                 
 [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C              LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8
 [5] LC_MONETARY=en_GB.utf8    LC_MESSAGES=en_GB.utf8    LC_PAPER=en_GB.utf8       LC_NAME=C            
 [9] LC_ADDRESS=C              LC_TELEPHONE=C            LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C  
                                                                                                        
attached base packages:                                                                                 
[1] graphics  grDevices utils     datasets  stats     methods   base                                    
                                                                                                         
other attached packages:                                                                                 
 [1] reprex_0.3.0        xtable_1.8-4        orgutils_0.4-1      lubridate_1.7.9     googlesheets4_0.2.0 
 [6] chron_2.3-56        ascii_2.3           spelling_2.1        RSQLite_2.2.0       rmsfact_0.0.3       
[11] rmarkdown_2.3       reticulate_1.16     magrittr_1.5        kableExtra_1.2.1    knitr_1.29          
[16] Hmisc_4.4-1         Formula_1.2-3       survival_3.2-3      lattice_0.20-41     googlesheets_0.3.0  
[21] foreign_0.8-80      fcuk_0.1.21         devtools_2.3.1      usethis_1.6.1       forcats_0.5.0       
[26] stringr_1.4.0       dplyr_1.0.2         purrr_0.3.4         readr_1.3.1         tidyr_1.1.2         
[31] tibble_3.0.3        ggplot2_3.3.2       tidyverse_1.3.0                                             
                                                                                                         
loaded via a namespace (and not attached):                                                               
 [1] googledrive_1.0.1   colorspace_1.4-1    ellipsis_0.3.1      rprojroot_1.3-2     htmlTable_2.0.1     
 [6] base64enc_0.1-3     fs_1.5.0            rstudioapi_0.11     remotes_2.2.0       bit64_4.0.5         
[11] fansi_0.4.1         xml2_1.3.2          splines_4.0.2       pkgload_1.1.0       jsonlite_1.7.1      
[16] broom_0.7.0         cluster_2.1.0       dbplyr_1.4.4        png_0.1-7           clipr_0.7.0         
[21] compiler_4.0.2      httr_1.4.2          backports_1.1.9     assertthat_0.2.1    Matrix_1.2-18       
[26] gargle_0.5.0        cli_2.0.2           later_1.1.0.1       htmltools_0.5.0     prettyunits_1.1.1   
[31] tools_4.0.2         gtable_0.3.0        glue_1.4.2          Rcpp_1.0.5          cellranger_1.1.0    
[36] vctrs_0.3.4         xfun_0.17           ps_1.3.4            testthat_2.3.2      rvest_0.3.6         
[41] lifecycle_0.2.0     textutils_0.2-0     stringdist_0.9.6    scales_1.1.1        promises_1.1.1      
[46] hms_0.5.3           parallel_4.0.2      RColorBrewer_1.1-2  curl_4.3            memoise_1.1.0       
[51] gridExtra_2.3       rpart_4.1-15        latticeExtra_0.6-29 stringi_1.5.3       desc_1.2.0          
[56] checkmate_2.0.0     pkgbuild_1.1.0      rlang_0.4.7         pkgconfig_2.0.3     evaluate_0.14       
[61] htmlwidgets_1.5.1   bit_4.0.4           processx_3.4.4      tidyselect_1.1.0    R6_2.4.1            
[66] generics_0.0.2      DBI_1.1.0           whisker_0.4         pillar_1.4.6        haven_2.3.1         
[71] withr_2.2.0         nnet_7.3-14         modelr_0.1.8        crayon_1.3.4        utf8_1.1.4          
[76] jpeg_0.1-8.1        grid_4.0.2          readxl_1.3.1        data.table_1.13.0   blob_1.2.1          
[81] callr_3.4.4         digest_0.6.25       webshot_0.5.2       httpuv_1.5.4        openssl_1.4.2       
[86] munsell_0.5.0       viridisLite_0.3.0   askpass_1.1         sessioninfo_1.1.1                       
> tidyverse_conflicts()                                                                                       
── Conflicts ───────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──                                                                                                            
✖ lubridate::as.difftime() masks base::as.difftime()                                                          
✖ lubridate::date()        masks base::date()                                                                 
✖ lubridate::days()        masks chron::days()                                                                
✖ ascii::expand()          masks tidyr::expand()                                                              
✖ magrittr::extract()      masks tidyr::extract()                                                             
✖ stats::filter()          masks dplyr::filter()                                                              
✖ kableExtra::group_rows() masks dplyr::group_rows()                                                          
✖ lubridate::hours()       masks chron::hours()                                                               
✖ lubridate::intersect()   masks base::intersect()                                                            
✖ stats::lag()             masks dplyr::lag()                                                                 
✖ lubridate::minutes()     masks chron::minutes()                                                             
✖ lubridate::seconds()     masks chron::seconds()                                                             
✖ magrittr::set_names()    masks purrr::set_names()                                                           
✖ lubridate::setdiff()     masks base::setdiff()                                                              
✖ Hmisc::src()             masks dplyr::src()                                                                 
✖ Hmisc::summarize()       masks dplyr::summarize()                                                           
✖ lubridate::union()       masks base::union()                                                                
✖ lubridate::years()       masks chron::years()                                                               

Any suggestions or pointers would be gratefully received, and if I need to provide more information please let me know.

Thanks in advance.

EDIT : As requested dput() of the data, I knew there was an easy way to dump out the data for others to use but couldn't for the life of me remember what it was, had been looking at the reprex package to help without success...

> dput(fuel_sheet %>% dplyr::select(vehicle, date, mileage))                                    
structure(list(vehicle = structure(c(3L, 3L, 3L, 3L, 3L, 3L,                                    
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 3L,                                 
1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,                                 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,                                 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("Ford Focus",                               
"Honda Jazz", "VW T4 Campervan"), class = "factor"), date = structure(c(16708,                  
16724, 16726, 16775, 16822, 16824, 16825, 16852, 16889, 16907,                                  
16914, 16915, 16915, 16973, 17009, 17011, 17015, 17025, 17047,                                  
17092, 17117, 17117, 17144, 17146, 17173, 17237, 17284, 17295,                                  
17319, 17333, 17361, 17423, 17434, 17453, 17466, 17558, 17579,                                  
17634, 17635, 17641, 17642, 17710, 17743, 17807, 17807, 17929,                                  
17995, 17999, 18004, 18042, 18074, 18077, 18106, 18129, 18139,                                  
18223, 18227, 18337, 18469, 18470, 18493, 18501, 18505, 18509                                   
), class = "Date"), mileage = c(150631, 150866, 151395, 151731,                                 
151968, 152360, 152731, 153066, 153500, 153751, 154363, 154619,                                 
154866, 155473, 156008, 156199, 156637, 156997, 157157, 61476,                                  
61774, 157458, 62175, 62409, 63209, 157769, 158325, 158803, 158956,                             
159314, 159706, 160126, 160550, 160986, 161363, 162000, 162449,                                 
162879, 163215, 163636, 164005, 164613, 164802, 165352, 165651,                                 
165946, 166488, 166987, 167131, 167749, 168013, 168361, 168853,                                 
169371, 169996, 170636, 170640, 170954, 171573, 171763, 172286,                                 
172616, 173105, 173266)), row.names = c(NA, -64L), groups = structure(list(                     
    vehicle = structure(c(1L, 3L), .Label = c("Ford Focus", "Honda Jazz",                       
    "VW T4 Campervan"), class = "factor"), .rows = structure(list(                              
        c(20L, 21L, 23L, 24L, 25L), c(1L, 2L, 3L, 4L, 5L, 6L,                                   
        7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,                                
        19L, 22L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L,                                  
        35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L, 45L,                                  
        46L, 47L, 48L, 49L, 50L, 51L, 52L, 53L, 54L, 55L, 56L,                                  
        57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L)), ptype = integer(0), class = c("vctrs_list_of",
    "vctrs_vctr", "list"))), row.names = 1:2, class = c("tbl_df",                               
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",                                    
"tbl_df", "tbl", "data.frame"))                                                                 

EDIT 2 : Omitting order_by doesn't help...

> fuel_sheet %>%  group_by(vehicle) %>% mutate(miles = mileage - lag(mileage, n=1))    
# A tibble: 64 x 10                                                                    
# Groups:   vehicle [2]                                                                
   vehicle         mileage litre   gbp station   date       month  year miles_lag miles
   <fct>             <dbl> <dbl> <dbl> <fct>     <date>     <dbl> <dbl>     <dbl> <dbl>
 1 VW T4 Campervan  150631  78.0  81.9 Tescos    2015-09-30     9  2015    150631     0
 2 VW T4 Campervan  150866  33.6  36.9 Nisa      2015-10-16    10  2015    150866     0
 3 VW T4 Campervan  151395  66    72.5 Tescos    2015-10-18    10  2015    151395     0
 4 VW T4 Campervan  151731  44    44.4 Morrisons 2015-12-06    12  2015    151731     0
 5 VW T4 Campervan  151968  34.1  34   Other     2016-01-22     1  2016    151968     0
 6 VW T4 Campervan  152360  49.1  48.1 Morrisons 2016-01-24     1  2016    152360     0
 7 VW T4 Campervan  152731  53.2  52.0 Tescos    2016-01-25     1  2016    152731     0
 8 VW T4 Campervan  153066  44.0  43   Tescos    2016-02-21     2  2016    153066     0
 9 VW T4 Campervan  153500  56.9  58   Tescos    2016-03-29     3  2016    153500     0
10 VW T4 Campervan  153751  27.6  29   Tescos    2016-04-16     4  2016    153751     0
# … with 54 more rows                                                                  

Solution

  • In the conflicts you have

    stats::lag()             masks dplyr::lag()
    

    If you specify dplyr::lag() inside both mutate's it should work. (I checked that specifying stats::lag() I get the same output than yours.)