Search code examples
rdatetimetimeposixctdate-math

Find most recent rows from one dataframe based on a second dataframe


I have two dataframes, dfburg and dfpolice. For each row in dfburg (a list of burglaries), I want to know what row in dfpolice (a list of police officer activities) was the most recent, and the row that occurred afterwards.

The end goal is to see for each burglary how close to the incident did an officer patrol the area, and after the burglary did they stick around and patrol some more.

So if a dfburg incident occurs 8/22 at 9am, I would want to pull two rows from dfpolice, the ones closest to 9am on 8/22, both before and after.

Thank you. I have datetime vars for both tables.

Here is dfpolice enter image description here

Here is dfburg enter image description here

So for each row in dfburg, there would be two rows of dfpolice, the one closest to it before and after, there will be duplicates.


Solution

  • I think an easy way to do this would be to combine the tables with something like dplyr::bind_rows(), and then create a before and after column that copy the police_id and fill up or down from there. This would require having a matching datetime column in each table, but that's easy enough.

    # I took some liberties with column names, as I didn't see the tables before writing
    dfpolice %>% 
      dplyr::bind_rows(dfburg) %>% 
      dplyr::arrange(dt) %>% 
      dplyr::select(burg_id, dt, police_id_before = police_id, police_id_after = police_id) %>% 
      tidyr::fill(police_id_before, .direction = 'down') %>% 
      tidyr::fill(police_id_after, .direction = 'up') %>% 
      dplyr::filter(complete.cases(.))
    

    From there you can just reference the dfpolice and dfburg table for whatever information you want, as the resulting table is the burglary id, the date of the burglary, and the police id for the call immediately before and immediately after.