Search code examples
rdateextract

In R: extracting data from a column based on closest date


I have two datasets. The first dataset is the main one, is wide, and has id as a unique/identifying variable. The second dataset has the same id variable, is long, and has multiple observations for id (i.e., id is not unique).

The second dataset only has three variables: id, date, and value. Id and date together also are not unique - there are repeated combinations. The value variable is a number. Annoyingly, some of the id/date repeats have different numbers in the value field, so I need to identify those so I can manually pick the value I need.

My ultimate goal is to copy the value field from the second dataset into the first, using the observation whose date is the closest to a key date from the first dataset. For example, say that in the first dataset, id==1 has a key date of 5/1/2024. Then in the second dataset, there is one row with id==1, date==4/30/2024, and value==3; and there is another row with id==1, date==5/5/2024, and value==4. I want 3 to be pulled into the first dataset into a new column for id==1.

https://github.com/markelphelps/sample-data-1 Check out the files in my github. The second dataset started as df2_long, and I tried to make it into df2_wide. Instead the best I could do was:

df_long2 <- reshape(df_long,
                       timevar = "date",
                       idvar = "id",
                       direction = "wide")

I hoped it would make two new columns, date and value. Instead it makes a new column for each unique date. Using pivot_wider gave me a list-col because of the repeats in date.

I do get the error “multiple rows for each date;” I’m guessing the easiest way to address that is to delete some before the reshape, but I’m not sure how. Something like “print each row of df where id’s are equal and date’s are equal.” Maybe something with a for loop or lapply/sapply that recycles the command for each unique id value.

Next I need to find closest date

which.min(abs(df1_main$key_date - df2_long$date))

I read that abs() was necessary to account for df2_long$date being sometimes before and sometimes after df1_main$key_date. But I get an error that says “abs not defined for “Date” objects.” When I take out abs(), the command returns “integer(0),” which doesn’t seem super useful.

The next step after finding that closest date is to do that for all id's, then turn them into a new variable.

edit: leaving work for the day, will check/respond to comments tomorrow

edit2: apologies for the mixup in df names, i forgot to rename the ones in my github. as you inferred df1_main == df ex_original; df2_long == df ex_long; and df3_wide == df ex_wide. I added the renamed versions (complete duplicates) in case that makes things more clear for anyone.

edit 3: thanks all, very informative! The best fit for my data was the comment from @Eugenio.Gastelum96 directing me here: https://stackoverflow.com/a/63754287/9588300. The comment by @teru to do leftjoin() with dplyr worked perfectly with the sample data. next week I'll try it on my real data and check back in if something else doesn't work.


Solution

  • Data table solution using a rolling join (returns the first record if there are ties):

    library(data.table)
    
    df_ex_original = fread("df ex_original.csv")
    df_ex_long = fread("df ex_long.csv")
    
    df_ex_original[, `:=`(key_date=as.Date(key_date, format="%m/%d/%Y"),
                         value=NULL)]
    
    df_ex_long[, date:=as.Date(date, format="%m/%d/%Y")]
    
    setkeyv(df_ex_original, c('id', 'key_date'))
    setkeyv(df_ex_long, c('id', 'date'))
    
    df_ex_long[df_ex_original, roll="nearest"]
    

    Key: <id, date>
          id       date value
       <int>     <Date> <int>
    1:     1 2024-01-06     3
    2:     2 2024-01-01     1
    3:     3 2024-01-10     1
    4:     4 2024-01-12     0
    5:     5 2024-01-14     6
    6:     6 2024-01-02     5
    

    Answering the comment:

    data.table commands update by reference, so there's no need to use <-. See the help page for more details. It shows how to use := when updating multiple columns.

    Assigning a variable to NULL effectively deletes that variable from the data and I didn't want the value variable in df1 since you wanted to obtain its values from df2. I could have used an update but this seemed like a better way to go.

    In my as.Date() command, I had an inexcusable error. Sorry! Please replace "%M" with "%m".