Search code examples
rdataframedatecensus

Creating a census table based on multiple variables


I have a relatively big data table that essentially includes -people -where they live -what they do -move-in dates -move-out dates. My goal is to derive a running weekly census table with each week as a row, and a column for each occupation and city, populated with the headcount at that time.

#MRE

library(tidyverse) 
library(lubridate)

data <- data.frame(
  first_names = c("joe", "sally", "bob", "frank", "susy"),
  move_in = as.Date(c("2020-01-01", "2021-01-04", "2020-04-01", "2018-12-20", "2019-10-12")),
  move_out = as.Date(c("2021-01-01", NA, "2021-10-01", NA, NA)),
  city = c("Denver", "Phoenix", "Austin", "Denver", "Seattle"),
  occupation = c("doctor", "doctor", "architect", "teacher", "teacher"))

#what I've tried :
  
cities = unique(data$city)[!is.na(unique(data$city))]
occupations = unique(data$occupation)[!is.na(unique(data$occupation))]
weeks <- (date = seq(from = as.Date("2020-12-27"), to = as.Date(today()), by="1 week"))

census <- matrix(data=NA, nrows=44, ncols=12) 

for (i in seq(cities)){
  for (j in seq(occupations)){
    count <- data %>% 
      filter(cities == i) %>%
      filter(occupations == j) %>% 
      sapply(weeks, function(x)
        sum(
          ((as.Date(data$move_in)) <= as.Date(x) &
           (as.Date(data$move_out)) > as.Date(x))|
          ((as.Date(data$move_in)) <= as.Date(x) &
           is.na(data$move_out))))
  
  census[j,x] <- count
}}

Solution

  • Here's a possible solution using some tidyverse verbs, since you loaded that package. We loop over the weeks you're interested in using the map_dfr function, and for each week we collect a subset of the people who are there using your logic statement above. Then, we can use group_by to skip the double outer loop and count them directly. Finally, we mutate a new column for week to keep them straight after they're bound together. Outside of the loop, we then pivot_wider to get the one-column-per-occupation and one-row-per-week format that you're looking for.

    library(tidyverse)
    
    data <- data.frame(
      first_names = c("joe", "sally", "bob", "frank", "susy"),
      move_in = as.Date(c("2020-01-01", "2021-01-04", "2020-04-01", "2018-12-20", "2019-10-12")),
      move_out = as.Date(c("2021-01-01", NA, "2021-10-01", NA, NA)),
      city = c("Denver", "Phoenix", "Austin", "Denver", "Seattle"),
      occupation = c("doctor", "doctor", "architect", "teacher", "teacher"))
    
    # Avoid needing to load lubridate by using Sys.Date() instead of today()
    weeks <- (date = seq(from = as.Date("2020-12-27"), to = as.Date(Sys.Date()), by="1 week"))
    
    map_dfr(weeks, function(week_i){
      data %>%
        filter(move_in<week_i & move_out > week_i | move_in < week_i & is.na(move_out)) %>%
        group_by(city, occupation) %>%
        count() %>%
        mutate(week=week_i)
    }) %>%
      pivot_wider(values_from = n, names_from = occupation, values_fill = 0)
    
    

    which returns

    # A tibble: 170 x 5
    # Groups:   city [4]
       city    week       architect doctor teacher
       <chr>   <date>         <int>  <int>   <int>
     1 Austin  2020-12-27         1      0       0
     2 Denver  2020-12-27         0      1       1
     3 Seattle 2020-12-27         0      0       1
     4 Austin  2021-01-03         1      0       0
     5 Denver  2021-01-03         0      0       1
     6 Seattle 2021-01-03         0      0       1
     7 Austin  2021-01-10         1      0       0
     8 Denver  2021-01-10         0      0       1
     9 Phoenix 2021-01-10         0      1       0
    10 Seattle 2021-01-10         0      0       1
    # ... with 160 more rows
    

    It looks like you're getting errors due to a couple typos. You're using the filter verb to ask for the cities column, but data only has a city column in the sample data set. Same for occupations vs occupation. Good to keep in mind for the future, but great first effort and nicely provided example!