Search code examples
rdatetimegroup-bysummarizedata-wrangling

R: Grouping and summarizing data to find the nth nearest event after date of reference


I have been working with a dataset that contains the CLIENT (identification), date of creation of an account (DATE ENTER) and the GATE use for customers in order to place orders in a website (Table A), but some of these attempts fail and that why a second data frame (Table B) is provided with the  CLIENT (identification), the ERROR_DATE and the MAGIC_IP that identifies the error as a side note all GATES and all MAGIC_IPs will always be different even if this implies the same customer  identification.

Table A

GATE CLIENT DATE_ENTER
A 1 1
B 1 1
C 1 3
D 1 4
E 1 5
F 1 6
G 1 7
H 1 8
I 1 9
J 1 10
K 1 11
L 1 12
M 1 16

Table B

HOST CLIENT DATE_ERROR
14558556 1 2
14558557 1 3
14558558 1 4
14558559 1 5
14558560 1 6
14558561 1 7
14558562 3 8
14558563 2 9
14558564 3 10
14558565 3 11
14558566 1 12
14558567 1 13

What I want to create is a table where I can see for each "CLIENT","DATE ENTER" and "GATE" what was the  ERROR_DATE and MAGIC_IP immediately  after or in other words the very next one.  The expected result should look like this:

GATE CLIENT DATE_ENTER DATE_ERROR MAGIC_IP
A 1 1 2 154
B 1 1 2 154
C 1 3 4 156
D 1 4 5 157
E 1 5 6 158
F 1 6 7 159
G 1 7 10 160
H 1 8 10 160
I 1 9 10 160
J 1 10 11 163
K 1 11 12 164
L 1 12 13 165
M 1 16 N.A N.A

In this table I am able to see for each GATE. CLIENT AND "DATE_ENTER" what was the very next or the closest nearest MAGIC_IP and ERROR_DATE for each client, thank you so much to everyone willing to help me out,

What have I tried?

I was thinking about grouping by CLIENT,DATE_ENTER and GATE after left joining the two tables by CLIENT, and then create a variable that would subtract all the dates and pick for each group the smallest positive value as an index to correlate the information about MAGIC_IP and ERROR_DATE, but I cant seem to find a way to do so at the moment thank you for all of your help guys

Data:

TableA<-tribble(~GATE,  ~CLIENT,    ~DATE_ENTER,
"A",    "1",    "1",
"B",    "1",    "1",
"C",    "1",    "3",
"D",    "1",    "4",
"E",    "1",    "5",
"F",    "1",    "6",
"G",    "1",    "7",
"H",    "1",    "8",
"I",    "1",    "9",
"J",    "1",    "10",
"K",    "1",    "11",
"L",    "1",    "12")

TableB<-tribble(~HOST,  ~CLIENT,    ~DATE_ERROR,
                "14558556", "1",    "2",
                "14558557", "1",    "3",
                "14558558", "1",    "4",
                "14558559", "1",    "5",
                "14558560", "1",    "6",
                "14558561", "1",    "7",
                "14558562", "3",    "8",
                "14558563", "2",    "9",
                "14558564", "3",    "10",
                "14558565", "3",    "11",
                "14558566", "1",    "12",
                "14558567", "1",    "13")

Solution

  • Join TableA and TableB by CLIENT and select the row which is the smallest positive difference between DATE_ERROR and DATE_ENTER.

    library(dplyr)
    
    TableA %>%
      mutate(row = row_number()) %>%
      left_join(TableB, by = 'CLIENT') %>%
      type.convert(as.is = TRUE) %>%
      group_by(row) %>%
      filter({tmp = DATE_ERROR - DATE_ENTER
              tmp == min(tmp[tmp > 0], na.rm = TRUE)
              }) %>%
      ungroup() %>%
      select(-row)