I have been working with a dataset that contains the CLIENT (identification), date of creation of an account (DATE ENTER) and the GATE use for customers in order to place orders in a website (Table A), but some of these attempts fail and that why a second data frame (Table B) is provided with the CLIENT (identification), the ERROR_DATE and the MAGIC_IP that identifies the error as a side note all GATES and all MAGIC_IPs will always be different even if this implies the same customer identification.
Table A
GATE | CLIENT | DATE_ENTER |
---|---|---|
A | 1 | 1 |
B | 1 | 1 |
C | 1 | 3 |
D | 1 | 4 |
E | 1 | 5 |
F | 1 | 6 |
G | 1 | 7 |
H | 1 | 8 |
I | 1 | 9 |
J | 1 | 10 |
K | 1 | 11 |
L | 1 | 12 |
M | 1 | 16 |
Table B
HOST | CLIENT | DATE_ERROR |
---|---|---|
14558556 | 1 | 2 |
14558557 | 1 | 3 |
14558558 | 1 | 4 |
14558559 | 1 | 5 |
14558560 | 1 | 6 |
14558561 | 1 | 7 |
14558562 | 3 | 8 |
14558563 | 2 | 9 |
14558564 | 3 | 10 |
14558565 | 3 | 11 |
14558566 | 1 | 12 |
14558567 | 1 | 13 |
What I want to create is a table where I can see for each "CLIENT","DATE ENTER" and "GATE" what was the ERROR_DATE and MAGIC_IP immediately after or in other words the very next one. The expected result should look like this:
GATE | CLIENT | DATE_ENTER | DATE_ERROR | MAGIC_IP |
---|---|---|---|---|
A | 1 | 1 | 2 | 154 |
B | 1 | 1 | 2 | 154 |
C | 1 | 3 | 4 | 156 |
D | 1 | 4 | 5 | 157 |
E | 1 | 5 | 6 | 158 |
F | 1 | 6 | 7 | 159 |
G | 1 | 7 | 10 | 160 |
H | 1 | 8 | 10 | 160 |
I | 1 | 9 | 10 | 160 |
J | 1 | 10 | 11 | 163 |
K | 1 | 11 | 12 | 164 |
L | 1 | 12 | 13 | 165 |
M | 1 | 16 | N.A | N.A |
In this table I am able to see for each GATE. CLIENT AND "DATE_ENTER" what was the very next or the closest nearest MAGIC_IP and ERROR_DATE for each client, thank you so much to everyone willing to help me out,
What have I tried?
I was thinking about grouping by CLIENT,DATE_ENTER and GATE after left joining the two tables by CLIENT, and then create a variable that would subtract all the dates and pick for each group the smallest positive value as an index to correlate the information about MAGIC_IP and ERROR_DATE, but I cant seem to find a way to do so at the moment thank you for all of your help guys
Data:
TableA<-tribble(~GATE, ~CLIENT, ~DATE_ENTER,
"A", "1", "1",
"B", "1", "1",
"C", "1", "3",
"D", "1", "4",
"E", "1", "5",
"F", "1", "6",
"G", "1", "7",
"H", "1", "8",
"I", "1", "9",
"J", "1", "10",
"K", "1", "11",
"L", "1", "12")
TableB<-tribble(~HOST, ~CLIENT, ~DATE_ERROR,
"14558556", "1", "2",
"14558557", "1", "3",
"14558558", "1", "4",
"14558559", "1", "5",
"14558560", "1", "6",
"14558561", "1", "7",
"14558562", "3", "8",
"14558563", "2", "9",
"14558564", "3", "10",
"14558565", "3", "11",
"14558566", "1", "12",
"14558567", "1", "13")
Join TableA
and TableB
by CLIENT
and select the row which is the smallest positive difference between DATE_ERROR
and DATE_ENTER
.
library(dplyr)
TableA %>%
mutate(row = row_number()) %>%
left_join(TableB, by = 'CLIENT') %>%
type.convert(as.is = TRUE) %>%
group_by(row) %>%
filter({tmp = DATE_ERROR - DATE_ENTER
tmp == min(tmp[tmp > 0], na.rm = TRUE)
}) %>%
ungroup() %>%
select(-row)