I have a list of accounts (300k plus rows), going back six years, with a user number, open and close dates, and other information, such as location. We offer a variety of accounts, and a user can have one or several, in any combination, and both in succession as well as overlapping.
I've been asked to find out how many users we have in any given month. They'd like it split by location, as well as total.
so I have a table like this:
User Open Close Area
1 A 2018-02-13 2018-07-31 West
2 B 2018-02-26 2018-06-04 North
3 B 2018-02-27 2018-03-15 North
4 C 2018-02-27 2018-05-26 South
5 C 2018-03-15 2018-06-03 South
6 D 2018-03-20 2018-07-02 East
7 E 2018-04-01 2018-06-19 West
8 E 2018-04-14 2018-05-04 West
9 F 2018-03-20 2018-04-19 North
10 G 2018-04-26 2018-07-04 South
11 H 2017-29-12 2018-03-21 East
12 I 2016-11-29 2020-04-10 West
13 J 2018-01-31 2018-12-20 West
14 K 2017-10-31 2018-10-30 North
15 K 2018-10-31 2019-10-30 North
And I want to get to one that looks something like this:
Month Total North East South West
1 Feb 18 3 1 0 1 1
2 Mar 18 5 2 1 1 1
3 Apr 18 7 2 1 2 2
4 May 18 6 1 1 2 2
5 Jun 18 6 1 1 2 2
6 Jul 18 3 0 1 1 1
I can filter the data to get to what I need for individual months using
df%>%
filter(Open <= as.Date("2018-04-30") & Close >= as.Date("2018-04-01")) %>%
distinct(PERSON_ID, .keep_all = TRUE) %>%
count(Area)
But what I can't figure out is how to repeat that for every month in the data set automatically. Is there any where of getting r to repeat the above for every month in my data set, and then pass the results into a second table?
Any and all help gratefully received, and many thanks for your time.
Edit: added examples to the source data where Matin Gal's solution returned NA for years
This is a general solution working for dates spanning over more than one year.
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
group_by(rn = row_number()) %>%
mutate(seq = list(seq(month(Open), month(Close) + 12 * (year(Close) - year(Open))))) %>%
unnest(seq) %>%
mutate(
seq_2 = (seq - 1) %% 12 + 1,
month = month(seq_2, label = TRUE),
year = year(Open + months(seq - first(seq)))
) %>%
ungroup() %>%
distinct(User, month, year, Area) %>%
count(month, year, Area) %>%
pivot_wider(
names_from = "Area",
values_from = "n",
values_fill = 0
) %>%
mutate(Total = rowSums(across(c(North, South, West, East))))
returns
month year North South West East Total
<ord> <dbl> <int> <int> <int> <int> <dbl>
1 Feb 2018 1 1 1 0 3
2 Mar 2018 2 1 1 1 5
3 Apr 2018 2 2 2 1 7
4 May 2018 1 2 2 1 6
5 Jun 2018 1 2 2 1 6
6 Jul 2018 0 1 1 1 3
df <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), User = c("A",
"B", "B", "C", "C", "D", "E", "E", "F", "G"), Open = structure(c(17575,
17588, 17589, 17589, 17605, 17610, 17622, 17635, 17610, 17647
), class = "Date"), Close = structure(c(17743, 17686, 17605,
17677, 17685, 17714, 17701, 17655, 17640, 17716), class = "Date"),
Area = c("West", "North", "North", "South", "South", "East",
"West", "West", "North", "South")), problems = structure(list(
row = 10L, col = "Area", expected = "", actual = "embedded null",
file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L), spec = structure(list(
cols = list(id = structure(list(), class = c("collector_double",
"collector")), User = structure(list(), class = c("collector_character",
"collector")), Open = structure(list(format = ""), class = c("collector_date",
"collector")), Close = structure(list(format = ""), class = c("collector_date",
"collector")), Area = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))