Search code examples
rdatedplyrfiltersubset

How do I filter out data based on a value in column while capturing minimum date criteria in another column?


I need to get a limited data based on the criteria of Ind being 'Y'. But it should only capture the first row when Ind changes from 'N' or 'O' value to 'Y'. In the Check_Date it should update that first value.

Input Data:

ID  Date    Ind 
2   201905  N    
2   201906  N    
2   201907  N    
2   201908  N    
2   201909  N    
2   201910  N    
2   201911  N    
2   201912  Y   
2   202001  Y    
2   202002  Y    
2   202003  Y    
2   202004  Y    
2   202005  N    
2   202006  N    
2   202007  N    
2   202008  Y   
3   201906  N    

Result:

ID  Date    Ind Check_Date
2   201912  Y   201912
2   202008  Y   202008

I didn't find a complete approach when I searched and was only able to filter out the data with Ind as Y. When I applied minimum condition to the date based on below code, it gave me limited data with first instance of ID that was Y on a particular Date. What am I doing wrong?

library(dplyr)

PO %>% 
  group(ID)
  filter(Date == min(Date)) %>%
  filter(Ind == 'Y') %>%
  slice(1) %>% # takes the first occurrence if there is a tie
  ungroup()

Solution

  • Using data.table

    library(data.table)
    setDT(df1)[df1[, !duplicated(Ind) & Ind == 'Y', .(ID, rleid(Ind))]$V1][, 
         Check_Date := Date][]
    

    -output

          ID   Date    Ind Check_Date
       <int>  <int> <char>      <int>
    1:     2 201912      Y     201912
    2:     2 202008      Y     202008