Search code examples
rdataframedplyrfiltertidyverse

R: Filtering and adding a repeating identical value in a column for hundreds of rows in a dataframe


Issue:

I have a data frame that has 16000 rows, and I am going to use this data to plot a boat track line in QGIS. One of the columns is called ' Course', which is the subsequent number order that the boat went out per field season.

For instance, course 1 ranges from 6/16/17 to 8/13/17. The boat track data frame has five columns for ID, date, time, course, latitude and longitude. These values were recorded every second throughout each day, which generally ranged from 6 am to 6 pm. Therefore, for each field course, there are hundreds of rows. Altogether there are 11 courses over a 5 year time period (2016-2018).

Is there any way that the data can be filtered per row by date to input a repeating number such as the value '2' (course 2) in the 'Course' column for the dates 8/14/17 to 10/16/17, and the repeating value of '3' (course 3) for the dates of 10/17/17 to 03/01/18, and so forth?

Unfortunately, I cannot share my data and I cannot find public any data that resembles my data frame that I can supply for this question using the function dput().

I have been using packages such as dplyr and tidyverse to try to solve this conundrum and I have so far been unsuccessful in finding a solution.

Would anyone be able to help?

Many thanks in advance.


Solution

  • As mentioned, it is helpful to make up example data for something like this. Here, I created 2 data.frames. One will have your data (16k rows), and the second will have your filter criteria (e.g., which course and date range you want to keep).

    df1 <- data.frame(
      ID = 1:10,
      date = seq.Date(as.Date("2016-01-01"), as.Date("2019-01-01"), by = 120),
      course = rep(1:5, each = 2)
    )
    df1
    
       ID       date course
    1   1 2016-01-01      1
    2   2 2016-04-30      1
    3   3 2016-08-28      2
    4   4 2016-12-26      2
    5   5 2017-04-25      3
    6   6 2017-08-23      3
    7   7 2017-12-21      4
    8   8 2018-04-20      4
    9   9 2018-08-18      5
    10 10 2018-12-16      5
    
    df2 <- data.frame(
      course = c(2, 3),
      start_date = as.Date(c("2016-01-01", "2017-05-09")),
      end_date = as.Date(c("2016-12-01", "2018-09-09"))
    )
    df2
    
          course start_date   end_date
    1      2 2016-01-01 2016-12-01
    2      3 2017-05-09 2018-09-09
    

    They don't have all the columns, but hopefully this will give you the idea.

    In my example, I would be filtering df1 where:

    • course is 2, and date falls between 1/1/16 and 12/1/16
    • course is 3, and date falls between 5/9/17 and 9/9/18

    Once you have that, you can try using fuzzyjoin package to merge the two, using fuzzy_semi_join. It will keep rows in the first data frame that match the criteria in the second data frame.

    library(fuzzyjoin)
    
    fuzzy_semi_join(
      df1,
      df2,
      by = c("course", "date" = "start_date", "date" = "end_date"),
      match_fun = c(`==`, `>=`, `<=`)
    )
    

    Output

      ID       date course
    3  3 2016-08-28      2
    6  6 2017-08-23      3
    

    In this case, only two rows meet those criteria.


    As an alternative, you can merge using the data.table package. This may be a faster solution. It should give you the same result.

    library(data.table)
    
    setDT(df1)
    setDT(df2)
    
    df1[df2, .(ID, x.date, course), on = .(course, date >= start_date, date <= end_date)]