I have a huge dataset and I would like to insert a dummy variable column based on a set of conditions:
I have my main df (A) in which I have 5 million rows and 10 columns, where 4 of them are date;hour;minute;second and these go from 2020 to 2023.
On the other df (B) I have the same columns but I have only 30 rows.
I want A to look at B and put a 1 to all the rows where date,hour,minute,second match the date,hour,minute,second of B, and 0 to all the rest. so, in the end, i should find my self with a column where I have 30 1
and 4.999.970 0
Even better would be to have like date,hour,minute matching exactly, and second matching "more or less" (say like +/- 5 seconds)
Can you help please?
I thought a solution could have been:
A$dummy <- for (i in A){
ifelse("A$date"=="B$date"&"A$hour"=="B$hour"&
"A$minute"=="B$minute"&or("A$second">="B$second"-5,"A$second"<="B$second"+5),1,0)
}
Here is a solution using tidyverse (including code that generates example data).
# Create data frame A
set.seed(42) # For reproducibility
n_rows_A <- 100
df_A <- data.frame(
date = rep(Sys.Date(), n_rows_A),
sym = rep("EUR_ab6m_05y", n_rows_A),
hour = sample(0:23, n_rows_A, replace = TRUE),
minute = sample(0:59, n_rows_A, replace = TRUE),
second = sample(0:59, n_rows_A, replace = TRUE)
)
# Create data frame B with 30 rows
n_rows_B <- 30
# Select 15 random rows from df_A
matching_indices <- sample(n_rows_A, n_rows_B / 2)
df_B_matching <- df_A[matching_indices, ]
# Create 15 rows with different 'second' values
non_matching_indices <- sample(setdiff(1:n_rows_A, matching_indices), n_rows_B / 2)
df_B_non_matching <- df_A[non_matching_indices, ]
df_B_non_matching$second <- sample(0:59, n_rows_B / 2, replace = TRUE)
# Combine matching and non-matching rows for data frame B
df_B <- rbind(df_B_matching, df_B_non_matching)
rownames(df_B) <- NULL
# EXACTLY MATCHING SECONDS
# Create dummy variable indicating whether there is an exact match between date,
# sym, hour, minute and **second** in A, and date, sym, hour, minute and
# **second** in B:
df_A <- dplyr::left_join( df_A
, df_B %>% mutate( dummy = 1 )
, by = c( "date", "sym", "hour", "minute", "second")
) %>%
mutate( dummy = ifelse( is.na(dummy), yes = 0, no = dummy) )
It is fully possible to use the same approach to create a dummy variable indicating whether a row in A
has a matching row in B
within a +/- 5 second margin. However, based on the data in the image you provided, some rows in A
would probably have multiple "close-enough" matches in B
. This could easily result in duplicates of rows from A
.
Assuming you want to keep the A
to its original number of rows and add a column to A
indicating whether or not there is one or more rows in B
that match within a +/- 5 second margin, you could do this:
# ALTERNATIVE 2 - SECONDS WITHIN +/- 5
library(lubridate)
df_A <- dplyr::inner_join( df_A
, df_B
, by = c( "date", "sym") # only use date as a key
, suffix = c("", "_b")
) %>%
mutate(
datetime_a = ymd_hms( paste0( date, " "
, sprintf("%02d", hour), ":"
, sprintf("%02d", minute), ":"
, sprintf("%02d", second)
))
, datetime_b = ymd_hms( paste0( date, " "
, sprintf("%02d", hour_b), ":"
, sprintf("%02d", minute_b), ":"
, sprintf("%02d", second_b)
) )
, diff_seconds = abs( as.integer( difftime( datetime_a, datetime_b, units = "secs") ) )
) %>%
# Remove rows where the difference between datetime_a and datetime_b is
# greater than 5 seconds:
filter( diff_seconds <= 5 ) %>%
# Then add the dummy variable:
mutate( dummy = 1 ) %>%
# Remove any unnecessary columns:
select( date, sym, hour, minute, second, dummy ) %>%
# Add the dummy column to A using a left_join:
dplyr::left_join( df_A, ., by = c("date", "sym", "hour", "minute", "second")) %>%
# Rows without any matches have dummy = NA. Recode:
mutate( dummy = ifelse( is.na(dummy), yes = 0, no = dummy) )