I have a sample table which looks somewhat like this:
| Date | Vendor_Id | Requisitioner | Amount |
|------------|:---------:|--------------:|--------|
| 1/17/2019 | 98 | John | 2405 |
| 4/30/2019 | 1320 | Dave | 1420 |
| 11/29/2018 | 3887 | Michele | 596 |
| 11/29/2018 | 3887 | Michele | 960 |
| 11/29/2018 | 3887 | Michele | 1158 |
| 9/21/2018 | 4919 | James | 857 |
| 10/25/2018 | 4919 | Paul | 1162 |
| 10/26/2018 | 4919 | Echo | 726 |
| 10/26/2018 | 4919 | Echo | 726 |
| 10/29/2018 | 4919 | Andrew | 532 |
| 10/29/2018 | 4919 | Andrew | 532 |
| 11/12/2018 | 4919 | Carlos | 954 |
| 5/21/2018 | 2111 | June | 3580 |
| 5/23/2018 | 7420 | Justin | 224 |
| 5/24/2018 | 1187 | Sylvia | 3442 |
| 5/25/2018 | 1187 | Sylvia | 4167 |
| 5/30/2018 | 3456 | Ama | 4580 |
Based on each requisitioner and vendor id, I need to find the difference in the date such that it should be something like this:
| Date | Vendor_Id | Requisitioner | Amount | Date_Diff |
|------------|:---------:|--------------:|--------|-----------|
| 1/17/2019 | 98 | John | 2405 | NA |
| 4/30/2019 | 1320 | Dave | 1420 | 103 |
| 11/29/2018 | 3887 | Michele | 596 | NA |
| 11/29/2018 | 3887 | Michele | 960 | 0 |
| 11/29/2018 | 3887 | Michele | 1158 | 0 |
| 9/21/2018 | 4919 | James | 857 | NA |
| 10/25/2018 | 4919 | Paul | 1162 | NA |
| 10/26/2018 | 4919 | Paul | 726 | 1 |
| 10/26/2018 | 4919 | Paul | 726 | 0 |
| 10/29/2018 | 4919 | Paul | 532 | 3 |
| 10/29/2018 | 4919 | Paul | 532 | 0 |
| 11/12/2018 | 4917 | Carlos | 954 | NA |
| 5/21/2018 | 2111 | Justin | 3580 | NA |
| 5/23/2018 | 7420 | Justin | 224 | 2 |
| 5/24/2018 | 1187 | Sylvia | 3442 | NA |
| 5/25/2018 | 1187 | Sylvia | 4167 | 1 |
| 5/30/2018 | 3456 | Ama | 4580 | NA |
Now, if the difference in the date is <=3 days within each requisitioner and vendor id, and sum of the amount is >5000, I need to create a subset of that. The final output should be something like this:
| Date | Vendor_Id | Requisitioner | Amount | Date_Diff |
|-----------|:---------:|--------------:|--------|-----------|
| 5/24/2018 | 1187 | Sylvia | 3442 | NA |
| 5/25/2018 | 1187 | Sylvia | 4167 | 1 |
Initially, when I tried working with date difference, I used the following code:
df=df %>% mutate(diffdate= difftime(Date,lag(Date,1)))
However, the difference doesn't make sense as they are huge numbers such as 86400 and some huge random numbers. I tried the above code when data type for 'Date' field was initially Posixct. Later when I changed it to 'Date' data type, the date differences were still the same huge random numbers. Also, is it possible to group the date differences based on requisitioners and vendor id's as mentioned in the 2nd table above?
EDIT: I'm coming across a new challenge now. In the problem set, I need to filter out the values whose date differences are less than 3 days. Let us assume that the table with date difference appears something like this:
| MasterCalendarDate | Vendor_Id | Requisitioner | Amount | diffdate |
|--------------------|:---------:|--------------:|--------|----------|
| 1/17/2019 | 98 | John | 2405 | #N/A |
| 4/30/2019 | 1320 | Dave | 1420 | 103 |
| 11/29/2018 | 3887 | Michele | 596 | #N/A |
| 11/29/2018 | 3887 | Michele | 960 | 0 |
| 11/29/2018 | 3887 | Michele | 1158 | 0 |
| 9/21/2018 | 4919 | Paul | 857 | #N/A |
| 10/25/2018 | 4919 | Paul | 1162 | 34 |
| 10/26/2018 | 4919 | Paul | 726 | 1 |
| 10/26/2018 | 4919 | Paul | 726 | 0 |
When we look at the requisitioner 'Paul', the date diff between 9/21/2018 and 10/25/2018 is 34 and between that of 10/25/2018 and 10/26/2018 is 1 day. However, when I filter the data for date difference <=3 days, I miss out on 10/25/2018 because of 34 days difference. I have multiple such occurrences. How can I fix it?
I think you need to convert your date variable using as.Date()
, then you can compute the lagged time difference using difftime()
.
# create toy data frame
df <- data.frame(date=as.Date(paste(sample(2018:2019,100,T),
sample(1:12,100,T),
sample(1:28,100,T),sep = '-')),
req=sample(letters[1:10],100,T),
amount=sample(100:10000,100,T))
# compute lagged time difference in days -- diff output is numeric
df %>% arrange(req,date) %>% group_by(req) %>%
mutate(diff=as.numeric(difftime(date,lag(date),units='days')))
# as above plus filtering based on time difference and amount
df %>% arrange(req,date) %>% group_by(req) %>%
mutate(diff=as.numeric(difftime(date,lag(date),units='days'))) %>%
filter(diff<10 | is.na(diff), amount>5000)
# A tibble: 8 x 4
# Groups: req [7]
date req amount diff
<date> <fct> <int> <dbl>
1 2018-05-13 a 9062 NA
2 2019-05-07 b 9946 2
3 2018-02-03 e 5697 NA
4 2018-03-12 g 7093 NA
5 2019-05-16 g 5631 3
6 2018-03-06 h 7114 6
7 2018-08-12 i 5151 6
8 2018-04-03 j 7738 8