Search code examples
rdataframedplyrrow

Correct lag() of column associated with an observation


I have a data frame as below, where I create a lagged column by using lag() on the observed values in the values column. Each row in my data frame is associated to a specific journey. I would like to correct the lag() operation as now, it doesn't consider if the value is the first on a new journey, meaning there should be no previous recording. Then I want to drop that row from my data frame.

By running the df_output, the desired output can be observed, but now it's done manually.

My real data frame contains a large amount of rows, and in turn journeys.

# Reproducible example
df <- data.frame(tours = c("kuu122", "kuu122", "ansc123123", "ansc123123", "ansc123123", "ansc123123", "baa3999", "baa3999", "baa3999", "baa3999"), order = c(4, 5, rep(c(1, 2, 3, 4), 2)), journey = c(1, 1, 2, 2, 2, 2, 3, 3, 3, 3), values = c(50, 60, 10, 20, 15, 13, 28, 15, 22, 14))

# Get the observed values at order_t
observed_values <- df$values
# Create lagged column
df$prev_values <- lag(observed_values, 1)

# TODO
# Remove row if prev_values are the first observation on a new journey
#???


df_output <- df[c(2, 4:6, 8:10),]
df_output

Solution

  • Using base R with duplicated

    subset(df, duplicated(journey))
    

    -output

             tours order journey values
    2      kuu122     5       1     60
    4  ansc123123     2       2     20
    5  ansc123123     3       2     15
    6  ansc123123     4       2     13
    8     baa3999     2       3     15
    9     baa3999     3       3     22
    10    baa3999     4       3     14