Search code examples
rvariablesinferencefeature-engineeringdata-wrangling

Variable creation - Inferring age


I have a grouped dataframe;

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000,3000,700,800,900,1000,20000,20000,30000,30000)
DF <- data.frame(Truck, OilChanged, Odometer)

# Truck OilChanged Odometer
# 1      A       True     1000
# 2      A     NewOil     1000
# 3      A      False     2000
# 4      A      False     3000
# 5      B      False      700
# 6      B      False      800
# 7      B      False      900
# 8      B      False     1000
# 9      C       True    20000
# 10     C     NewOil    20000
# 11     C       True    30000
# 12     C     NewOil    30000

I'm trying to infer the age of the oil (in kilometers) when possible. Inference is only possible once an oil change has occurred. If no oil change occurs the age of the oil will remain a mystery (example: Truck B).

Below is the desired result;

Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
Odometer <- c(1000, 1000, 2000, 3000,700,800,900,1000,20000,20000,30000,30000)
OilAge <- c(NA,0,1000,2000,NA,NA,NA,NA,NA,0,10000,0)
Result <- data.frame(Truck, OilChanged, Odometer, OilAge)


# Truck OilChanged Odometer OilAge
# 1      A       True     1000     NA
# 2      A     NewOil     1000      0
# 3      A      False     2000   1000
# 4      A      False     3000   2000
# 5      B      False      700     NA
# 6      B      False      800     NA
# 7      B      False      900     NA
# 8      B      False     1000     NA
# 9      C       True    20000     NA
# 10     C     NewOil    20000      0
# 11     C       True    30000  10000
# 12     C     NewOil    30000      0

Note: The odometer reading between a True oilchanged row and it's following NewOil row will always be the same. Because an oil sample is taken directly before the oil is changed. But both rows must be maintained for downstream calculations to function properly, such as rate-of-change formulas.

NA in the OilAge column means the age is a mystery.


Solution

  • Please let me know if this solution works for you.

    Truck <- c('A','A','A','A','B','B','B','B','C','C','C','C')
    OilChanged <- c('True','NewOil','False','False','False','False','False','False','True','NewOil','True','NewOil')
    Odometer <- c(1000, 1000, 2000,3000,700,800,900,1000,20000,20000,30000,30000)
    DF <- data.frame(Truck, OilChanged, Odometer)
    
    DF %>%
      group_by(Truck) %>%
      mutate(status = length(unique(OilChanged)),
             OilAge = ifelse(OilChanged == "NewOil", 0,
                             ifelse(OilChanged == "False", Odometer - (Odometer - lag(Odometer)),
                                    ifelse(OilChanged == "True", Odometer - lag(Odometer), NA)))) %>%
      mutate(OilAge = ifelse(status !=1, OilAge, NA)) %>%
      subset(select = c(Truck, OilChanged, Odometer, OilAge))