Search code examples
rdata.tableprocessing-efficiency

How to remove values from a vector which are larger than previous values in the vector


I'm working with a large dataset (400M observations) of cars. Each entry has a VIN number, identifying it as a specific car. I group the dataset by VIN, so each group is 1 car (usually with multiple entries). I'm currently using the deployr package (I recently learned R using tidyverse).

I am trying to remove all observations where a vehicle increases in price as the car gets older (appreciates in value over time).

Let's look at 1 car as an example, I've removed all columns besides the relevant data_year (year in which the data was recorded) and veh_price (vehicle price) below:

ex_car <- structure(list(data_year = c(2003, 2008, 2009, 2009, 2010, 2012, 
2013, 2015), veh_price = c(2341, 50, 50, 100, 2600, 450, 3750, 
23000)), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))

The columns are already sorted by data_year (descending order). As mentioned above, I am attempting to filter out any rows where the vehicle price increases from the previous row.

data_year     veh_price
2003          2341
2008          50            
2009          50            
2009          100           
2010          2600          
2012          450           
2013          3750  

the desired output:

data_year     veh_price
2003          2341  
2008          50        
2009          50            

I considered using mutate and lead to create a new row of the next value, calculating the difference, and then filtering all negative values (see below)

filtered_sample <- ex_car|> 
  mutate(
    next_price = lead(veh_price),
    diff_p = veh_price - next_price,
  ) |> 
  filter( diff_p >= 0) 

However I'm concerned adding 2 columns is too inefficient to run on the large dataset. Is there a better, more efficient way of solving this problem? (I've heard data.table is optimized for larger operations, is there a way to use this package to do achieve the same goal?)

When I was first approaching this issue, I thought of iterating once through the vector, storing a temporary global lowest_price variable (initialized as the first value in the vector). if the next price is lower or equal (<=), set it as the new lowest_price, if it is greater (>), remove the value.

However, I couldn't figure out how to do this well in R.

Thanks for your time, please let me know if I failed to comprehensively describe the problem or if you need any more information.


Solution

  • UPDATE: x == cummin(.) might be prone to R FAQ 7.31, a problem with tests of equality on high-precision floating point numbers. While not likely with this data, if the price-like variable has high precision and/or you want to be a bit safer in this regards, use @GregorThomas's suggested alternative of diff(.) <= 0, though this needs to be cumulative somehow:

    # dplyr
    ex_car %>%
      group_by(VIN) %>%
      filter(cumall(c(TRUE, diff(veh_price) <= 0)))
    # data.table
    as.data.table(ex_car)[, .SD[c(TRUE, cumsum(diff(veh_price) > 0) == 0),], by = .(VIN)]
    # base
    ind <- with(ex_car, !!ave(veh_price, VIN, FUN = function(z) c(TRUE, cumsum(diff(z) > 0) == 0)))
    ex_car[ind,]
    

    We can use cummin here (relying on non-decreasing data_year). (This is for cumulative min, not a mis-spelling of the spice cumin ;-)

    dplyr

    library(dplyr)
    ex_car %>%
      filter(veh_price == cummin(veh_price))
    # # A tibble: 3 × 2
    #   data_year veh_price
    #       <dbl>     <dbl>
    # 1      2003      2341
    # 2      2008        50
    # 3      2009        50
    

    Grouping by your VIN:

    ex_car %>%
      group_by(VIN) %>%
      filter(veh_price == cummin(veh_price))
    ### or with dplyr_1.1.0 or newer
    ex_car %>%
      filter(veh_price == cummin(veh_price), .by = VIN)
    

    data.table

    library(data.table)
    as.data.table(ex_car)[ veh_price == cummin(veh_price),]
    #    data_year veh_price
    #        <num>     <num>
    # 1:      2003      2341
    # 2:      2008        50
    # 3:      2009        50
    

    Grouping by your VIN:

    as.data.table(ex_car)[ veh_price == cummin(veh_price), by = .(VIN)]
    

    base R

    ind <- with(ex_car, veh_price == cummin(veh_price))
    ex_car[ind,]
    # # A tibble: 3 × 2
    #   data_year veh_price
    #       <dbl>     <dbl>
    # 1      2003      2341
    # 2      2008        50
    # 3      2009        50
    

    Gropuing by your VIN requires the addition of ave:

    ind <- with(ex_car, ave(veh_price, VIN, FUN = function(z) z == cummin(z)))