Search code examples
rdelete-rowmultiple-conditions

How can I remove rows on more conditions in R?


I have session id's, client id's, a conversion column and all with a specific date. I want to delete the rows after the last purchase of a client. My data looks as follows:

SessionId       ClientId        Conversion         Date
    1               1                0             05-01
    2               1                0             06-01
    3               1                0             07-01
    4               1                1             08-01
    5               1                0             09-01
    6               2                0             05-01 
    7               2                1             06-01
    8               2                0             07-01
    9               2                1             08-01
    10              2                0             09-01

As output I want:

SessionId       ClientId        Conversion         Date
    1               1                0             05-01
    2               1                0             06-01
    3               1                1             07-01
    6               2                0             05-01 
    7               2                1             06-01
    8               2                0             07-01
    9               2                1             08-01

I looks quite easy, but it has some conditions. Based on the client id, the sessions after the last purchase of a cutomer need to be deleted. I have many observations, so deleting after a particular date is not possible. It need to check every client id on when someone did a purchase.

I have no clue what kind of function I need to use for this. Maybe a certain kind of loop?

Hopefully someone can help me with this.


Solution

  • If your data is already ordered according to Date, for each ClientId we can select all the rows before the last conversion took place.

    This can be done in base R :

    subset(df, ave(Conversion == 1, ClientId, FUN = function(x) seq_along(x) <= max(which(x))))
    

    Using dplyr :

    library(dplyr)
    df %>% group_by(ClientId) %>% filter(row_number() <= max(which(Conversion == 1)))
    

    Or data.table :

    library(data.table)
    setDT(df)[, .SD[seq_len(.N) <= max(which(Conversion == 1))], ClientId]