Search code examples
rdateduration

R - how to count users based on open/close dates of accounts, but with users having multiple accounts


I have a list of accounts (300k plus rows), going back six years, with a user number, open and close dates, and other information, such as location. We offer a variety of accounts, and a user can have one or several, in any combination, and both in succession as well as overlapping.

I've been asked to find out how many users we have in any given month. They'd like it split by location, as well as total.

so I have a table like this:

   User    Open       Close      Area 
 1 A       2018-02-13 2018-07-31 West 
 2 B       2018-02-26 2018-06-04 North
 3 B       2018-02-27 2018-03-15 North
 4 C       2018-02-27 2018-05-26 South
 5 C       2018-03-15 2018-06-03 South
 6 D       2018-03-20 2018-07-02 East 
 7 E       2018-04-01 2018-06-19 West 
 8 E       2018-04-14 2018-05-04 West 
 9 F       2018-03-20 2018-04-19 North
10 G       2018-04-26 2018-07-04 South
11 H       2017-29-12 2018-03-21 East
12 I       2016-11-29 2020-04-10 West
13 J       2018-01-31 2018-12-20 West
14 K       2017-10-31 2018-10-30 North
15 K       2018-10-31 2019-10-30 North

And I want to get to one that looks something like this:

      Month  Total North  East South  West
1 Feb 18     3     1     0     1     1
2 Mar 18     5     2     1     1     1
3 Apr 18     7     2     1     2     2
4 May 18     6     1     1     2     2
5 Jun 18     6     1     1     2     2
6 Jul 18     3     0     1     1     1

I can filter the data to get to what I need for individual months using

 df%>%
   filter(Open <= as.Date("2018-04-30") & Close >= as.Date("2018-04-01")) %>%
distinct(PERSON_ID, .keep_all = TRUE) %>%
   count(Area) 

But what I can't figure out is how to repeat that for every month in the data set automatically. Is there any where of getting r to repeat the above for every month in my data set, and then pass the results into a second table?

Any and all help gratefully received, and many thanks for your time.

Edit: added examples to the source data where Matin Gal's solution returned NA for years


Solution

  • This is a general solution working for dates spanning over more than one year.

    library(dplyr)
    library(tidyr)
    library(lubridate)
    
    data %>%
      group_by(rn = row_number()) %>%
      mutate(seq = list(seq(month(Open), month(Close) + 12 * (year(Close) - year(Open))))) %>% 
      unnest(seq) %>%
      mutate(
        seq_2 = (seq - 1) %% 12 + 1,
        month = month(seq_2, label = TRUE),
        year  = year(Open + months(seq - first(seq)))
        ) %>% 
      ungroup() %>% 
      distinct(User, month, year, Area) %>% 
      count(month, year, Area) %>% 
      pivot_wider(
        names_from = "Area", 
        values_from = "n", 
        values_fill = 0
        ) %>% 
      mutate(Total = rowSums(across(c(North, South, West, East))))
    

    returns

      month  year North South  West  East Total
      <ord> <dbl> <int> <int> <int> <int> <dbl>
    1 Feb    2018     1     1     1     0     3
    2 Mar    2018     2     1     1     1     5
    3 Apr    2018     2     2     2     1     7
    4 May    2018     1     2     2     1     6
    5 Jun    2018     1     2     2     1     6
    6 Jul    2018     0     1     1     1     3
    

    Data

    df <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), User = c("A", 
    "B", "B", "C", "C", "D", "E", "E", "F", "G"), Open = structure(c(17575, 
    17588, 17589, 17589, 17605, 17610, 17622, 17635, 17610, 17647
    ), class = "Date"), Close = structure(c(17743, 17686, 17605, 
    17677, 17685, 17714, 17701, 17655, 17640, 17716), class = "Date"), 
        Area = c("West", "North", "North", "South", "South", "East", 
        "West", "West", "North", "South")), problems = structure(list(
        row = 10L, col = "Area", expected = "", actual = "embedded null", 
        file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df", 
    "tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl", 
    "data.frame"), row.names = c(NA, -10L), spec = structure(list(
        cols = list(id = structure(list(), class = c("collector_double", 
        "collector")), User = structure(list(), class = c("collector_character", 
        "collector")), Open = structure(list(format = ""), class = c("collector_date", 
        "collector")), Close = structure(list(format = ""), class = c("collector_date", 
        "collector")), Area = structure(list(), class = c("collector_character", 
        "collector"))), default = structure(list(), class = c("collector_guess", 
        "collector")), skip = 1L), class = "col_spec"))