Search code examples
rif-statementnested-loopslookupdummy-variable

R ifelse dummy variable on multiple condition lookup


I have a huge dataset and I would like to insert a dummy variable column based on a set of conditions:

I have my main df (A) in which I have 5 million rows and 10 columns, where 4 of them are date;hour;minute;second and these go from 2020 to 2023.

enter image description here

On the other df (B) I have the same columns but I have only 30 rows.

I want A to look at B and put a 1 to all the rows where date,hour,minute,second match the date,hour,minute,second of B, and 0 to all the rest. so, in the end, i should find my self with a column where I have 30 1 and 4.999.970 0

Even better would be to have like date,hour,minute matching exactly, and second matching "more or less" (say like +/- 5 seconds)

Can you help please?

I thought a solution could have been:

A$dummy <- for (i in A){

ifelse("A$date"=="B$date"&"A$hour"=="B$hour"&
"A$minute"=="B$minute"&or("A$second">="B$second"-5,"A$second"<="B$second"+5),1,0)
}

Solution

  • Here is a solution using tidyverse (including code that generates example data).

    # Create data frame A
    set.seed(42)  # For reproducibility
    n_rows_A <- 100
    
    df_A <- data.frame(
      date = rep(Sys.Date(), n_rows_A),
      sym = rep("EUR_ab6m_05y", n_rows_A),
      hour = sample(0:23, n_rows_A, replace = TRUE),
      minute = sample(0:59, n_rows_A, replace = TRUE),
      second = sample(0:59, n_rows_A, replace = TRUE)
    )
    
    # Create data frame B with 30 rows
    n_rows_B <- 30
    
    # Select 15 random rows from df_A
    matching_indices <- sample(n_rows_A, n_rows_B / 2)
    df_B_matching <- df_A[matching_indices, ]
    
    # Create 15 rows with different 'second' values
    non_matching_indices <- sample(setdiff(1:n_rows_A, matching_indices), n_rows_B / 2)
    df_B_non_matching <- df_A[non_matching_indices, ]
    df_B_non_matching$second <- sample(0:59, n_rows_B / 2, replace = TRUE)
    
    # Combine matching and non-matching rows for data frame B
    df_B <- rbind(df_B_matching, df_B_non_matching)
    
    rownames(df_B) <- NULL
    
    # EXACTLY MATCHING SECONDS
    # Create dummy variable indicating whether there is an exact match between date,
    # sym, hour, minute and **second** in A, and date, sym, hour, minute and
    # **second** in B:
    df_A <- dplyr::left_join( df_A
                            , df_B %>% mutate( dummy = 1 )
                            , by = c( "date", "sym", "hour", "minute", "second")
    ) %>% 
      mutate( dummy = ifelse( is.na(dummy), yes = 0, no = dummy) )
    

    It is fully possible to use the same approach to create a dummy variable indicating whether a row in A has a matching row in B within a +/- 5 second margin. However, based on the data in the image you provided, some rows in A would probably have multiple "close-enough" matches in B. This could easily result in duplicates of rows from A.

    Assuming you want to keep the A to its original number of rows and add a column to A indicating whether or not there is one or more rows in B that match within a +/- 5 second margin, you could do this:

    # ALTERNATIVE 2 - SECONDS WITHIN +/- 5 
    
    library(lubridate)
    df_A <- dplyr::inner_join( df_A
                               , df_B 
                               , by = c( "date", "sym") # only use date as a key
                               , suffix = c("", "_b")
    ) %>% 
      mutate( 
        datetime_a = ymd_hms( paste0( date, " "
                                      , sprintf("%02d", hour), ":"
                                      , sprintf("%02d", minute), ":"
                                      , sprintf("%02d", second)
        ))
        , datetime_b = ymd_hms( paste0( date, " "
                                        , sprintf("%02d", hour_b), ":"
                                        , sprintf("%02d", minute_b), ":"
                                        , sprintf("%02d", second_b)
        ) )
        , diff_seconds = abs( as.integer( difftime( datetime_a, datetime_b, units = "secs") ) )
        
      ) %>% 
      # Remove rows where the difference between datetime_a and datetime_b is
      # greater than 5 seconds:
      filter( diff_seconds <= 5 ) %>%
      # Then add the dummy variable:
      mutate( dummy = 1 ) %>% 
      # Remove any unnecessary columns:
      select( date, sym, hour, minute, second, dummy ) %>% 
      # Add the dummy column to A using a left_join:
      dplyr::left_join( df_A, ., by = c("date", "sym", "hour", "minute", "second")) %>% 
      # Rows without any matches have dummy = NA. Recode:
      mutate( dummy = ifelse( is.na(dummy), yes = 0, no = dummy) )