Search code examples
rtidycensus

Capturing the range indicated by the use of "to" in a column - tidycensus (U.S. Census API)


How might I capture all the years of age in a column with values like "20 to 24 years" for one group and "22 to 24 years" for another group? This will enable me to confirm I have all the working age (18-64) variable names captured in a tidycensus (R package) U.S. Census API query.

Goal

What I want is, for ages 20-24 in this example, a data frame that extracts the ages from label entries like "22 to 24 years":

MEN  WOMEN ETHNORACE
18   18    BLACK
19   19    BLACK
20   20    BLACK
21   21    BLACK
22   22    BLACK
23   23    BLACK
24         BLACK

I can then easily create a data frame that has all the ages and compare to see if any are missing.

Census variables (tidycensus)

One can see at https://api.census.gov/data/2019/acs/acs5/variables.html that at least the American Community Survey (ACS) by the U.S. Census has age range fields with varying syntax (e.g. "20 years" and "22 to 24 years"):

Example rows from tidycensus package's load_variables function

tidycensus R package version 1.1

## Example rows from tidycensus using:
library(tidycensus)
library(magrittr)
library(stringr)

v19     <- load_variables(2019, "acs5", cache = TRUE)
v19 %>% 
  dplyr::filter(
    str_detect(label, "18|20|24") & 
               concept %in% c("SEX BY AGE",
                              "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
               grepl('FEMALE', toupper(label))
                )

v19_Total_AndBlack_Age18_24 <-
  v19 %>% dplyr::filter(
  str_detect(label, "18|20|24") & 
  concept %in% c("SEX BY AGE",
                 "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
  grepl('FEMALE', toupper(label)))

 print(v19_Total_AndBlack_Age18_24)

  name        label                                      concept                                     
  <chr>       <chr>                                      <chr>                                       
1 B01001_031  Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE                                  
2 B01001_032  Estimate!!Total:!!Female:!!20 years        SEX BY AGE                                  
3 B01001_034  Estimate!!Total:!!Female:!!22 to 24 years  SEX BY AGE                                  
4 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
5 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years  SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
...

In this example, I want to make sure every age from 18-24 for the Total and Black populations is present in a dataframe like the following - notice the use of the Census API names from the above's v19_Total_AndBlack_Age18_24.

v19_Total_AndBlack_Age18_24 <-
  get_acs(
    year = 2019,
    geography = "zcta",
    variables = c(v19_Total_AndBlack_Age18_24$name)
 )

Notice that Total "22 to 24 years" compares to Black "20 to 24 years".

Let's focus on dataframe v19_Total_AndBlack_Age18_24 above, which lists out the Census API names and labels for ages 18 - 24, and aim to confirm all years are present.

I can get all of the numbers in the ages with a regular expression via:

unlist(str_extract_all(v19_Total_AndBlack_Age18_24$label,"\\d{2}"))
[1] "18" "19" "20" "22" "24" "18" "19" "20" "24"

But my attempts to group by the category are failing, and I still need to get a vector that spans the age ranges when the word "to" appears as in "20 to 24".

v19_Total_AndBlack_Age18_24_grp <- 
  v19_Total_AndBlack_Age18_24 %>% 
   mutate(EthnoRace = case_when(
   grepl('BLACK', concept) ~ "BLACK",
   TRUE ~ "TOTAL"))

v19_Total_AndBlack_Age18_24_grp %>% 
  group_by(EthnoRace) %>% 
  mutate(ages = str_extract_all(label, "\\d{2"))

Error

Error: Problem with `mutate()` column `ages`.
i `ages = str_extract_all(label, "\\d{2")`.
x Error in {min,max} interval. (U_REGEX_BAD_INTERVAL, context=`\d{2`)
i The error occurred in group 1: Group = "TOTAL".

Solution

  • First, for what ages, genders, and ethnoracial groups does one want data? This could be modified to only choose one gender. Gender_var needs to appear in the label column at https://api.census.gov/data/2019/acs/acs5/variables.html (or can use load_variables() like below when dataframe v19 is created).

    Parameters

    Only have to set these for code to run.

    min_age_desired <- 18
    max_age_desired <- 24
    Gender_var = c("MALE", "FEMALE")
    EthnoRace_var = c("BLACK","TOTAL")
    

    Now, let's create a QC data frame with all the ethnoracial groups and age groups we need.

    Load R packages

    library(arsenal)
    library(dplyr)
    library(stringr)
    library(tidyr)
    library(tidycensus)
    
    options(scipen = 8)
    

    Validation data frame

    has all the ages and ethnoracial groups one wants from the Census API

    AGE_var  = as.numeric(seq(min_age_desired, max_age_desired, 1))
    all_grp_qc_frm <- 
      data.frame(
        ## dupli
        expand.grid(EthnoRace_var,
                    Gender_var,
                    AGE_var
              )
      )
    
    colnames(all_grp_qc_frm) <- 
      c("EthnoRace", "Gender", "AGE")
    
    all_grp_qc_frm$AGE <- as.numeric(
      all_grp_qc_frm$AGE)
    all_grp_qc_frm$EthnoRace <- as.character(
      all_grp_qc_frm$EthnoRace)
    all_grp_qc_frm$Gender <- as.character(
      all_grp_qc_frm$Gender)
    
    all_grp_qc_frm <- all_grp_qc_frm %>% 
      arrange(EthnoRace,Gender,AGE)
    
    print(all_grp_qc_frm)
    
       EthnoRace Gender AGE
    1      BLACK   MALE  18
    2      BLACK   MALE  19
    3      BLACK   MALE  20
    4      BLACK   MALE  21
    5      BLACK   MALE  22
    6      BLACK   MALE  23
    7      BLACK   MALE  24
    8      BLACK FEMALE  18
    9      BLACK FEMALE  19
    10     BLACK FEMALE  20
    11     BLACK FEMALE  21
    12     BLACK FEMALE  22
    13     BLACK FEMALE  23
    14     BLACK FEMALE  24
    15     TOTAL   MALE  18
    16     TOTAL   MALE  19
    17     TOTAL   MALE  20
    18     TOTAL   MALE  21
    19     TOTAL   MALE  22
    20     TOTAL   MALE  23
    21     TOTAL   MALE  24
    22     TOTAL FEMALE  18
    23     TOTAL FEMALE  19
    24     TOTAL FEMALE  20
    25     TOTAL FEMALE  21
    26     TOTAL FEMALE  22
    27     TOTAL FEMALE  23
    28     TOTAL FEMALE  24
    

    Load the Census variables with tidycensus

    This is for the 2019 American Community Survey 5-year estimates

    v19     <- load_variables(2019, "acs5", cache = TRUE)
    

    Subset those variables to those needed

    There are many variables available through the Census API.

    To subset, let's first get a vector with each age from 18 to 24 separated by a pipe.

    working_age_vec <- paste0(seq(18,24,1), collapse = "|")
    

    Notice I need to use the correct concept values to get Black and the Total population across ethnoracial groups.

    v19_Total_And_EthnoRace_Age18_24 <-
      v19 %>% dplyr::filter(
      str_detect(label, working_age_vec) & 
      concept %in% c("SEX BY AGE",
                     "SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)") &
      grepl('FEMALE|MALE', toupper(label)))
    
    print(v19_Total_And_EthnoRace_Age18_24)
    
    # A tibble: 12 x 3
       name        label                                      concept                                     
       <chr>       <chr>                                      <chr>                                       
     1 B01001_007  Estimate!!Total:!!Male:!!18 and 19 years   SEX BY AGE                                  
     2 B01001_008  Estimate!!Total:!!Male:!!20 years          SEX BY AGE                                  
     3 B01001_009  Estimate!!Total:!!Male:!!21 years          SEX BY AGE                                  
     4 B01001_010  Estimate!!Total:!!Male:!!22 to 24 years    SEX BY AGE                                  
     5 B01001_031  Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE                                  
     6 B01001_032  Estimate!!Total:!!Female:!!20 years        SEX BY AGE                                  
     7 B01001_033  Estimate!!Total:!!Female:!!21 years        SEX BY AGE                                  
     8 B01001_034  Estimate!!Total:!!Female:!!22 to 24 years  SEX BY AGE                                  
     9 B01001B_007 Estimate!!Total:!!Male:!!18 and 19 years   SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
    10 B01001B_008 Estimate!!Total:!!Male:!!20 to 24 years    SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
    11 B01001B_022 Estimate!!Total:!!Female:!!18 and 19 years SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
    12 B01001B_023 Estimate!!Total:!!Female:!!20 to 24 years  SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE)
    

    Pull those variables with the Census API

    Census_Total_AndBlack_Age18_24 <-
      get_acs(
        year = 2019,
        geography = "zcta",
        variables = c(v19_Total_AndBlack_Age18_24$name)
      )
    

    Get concept values like SEX BY AGE (BLACK OR AFRICAN AMERICAN ALONE) and label values like Estimate!!Total:!!Male:!!10 to 14 years.

    Census_Total_AndBlack_Age18_24 <- left_join(
      Census_Total_AndBlack_Age18_24, 
      v19 %>% 
        select(name, concept, label) %>% 
        rename(variable = name)
     )
    

    Regular expression extractions + sequence

    Regular expressions to extract ages and create a vector expressing the sequence from the lowest to highest age per range.

    Census_Total_AndBlack_Age18_24_grp <- 
      Census_Total_AndBlack_Age18_24 %>%
      distinct(label, concept) %>% 
        ## regular expression to extract all the numbers in labels like
        ## Estimate!!Total:!!Male:!!5 to 9 years
        mutate(ages = sapply(str_extract_all(label,"\\d{2}"),
                                function(x) paste(x,collapse=""))) %>% 
        mutate(start = str_sub(ages, 1, 2),
                 end = str_sub(ages, 3, 4)) %>% 
        mutate(
          start = case_when(
                is.na(start) ~ "99",
                TRUE ~ start),
          end = case_when(
                is.na(end) ~ "99",
                TRUE ~ end)) %>% 
        dplyr::filter(grepl('Female|Male', label)) %>% 
        mutate(Gender = case_when(
          grepl('Female', label) ~ "FEMALE",
          grepl('Male', label) ~ "MALE",
          TRUE ~ "MISSING")) %>% 
        mutate(EthnoRace = case_when(
          grepl('BLACK', concept) ~ "BLACK",
          TRUE ~ "TOTAL")) %>% 
        mutate(end = case_when(
          is.na(end) | end == "" ~ start,
          TRUE ~ end))
        
        Census_Total_AndBlack_Age18_24_grp_sum <- Census_Total_AndBlack_Age18_24_grp %>% 
          group_by(EthnoRace, Gender) %>% 
          summarize(AGE = as.numeric(unlist(purrr::map2(start, end, `:`)))) %>% 
          ungroup() %>% 
          distinct(EthnoRace, Gender, AGE)
    

    QC comparison

    Remember that the desired ethnoracial groups, gender groups, and ages are printed at the top of this solution.

    arsenal::comparedf(all_grp_qc_frm, Census_Total_AndBlack_Age18_24_grp_sum)
    

    [...]

    Not shared: 0 variables and 0 observations.
    
    Differences found in 0/3 variables compared.
    0 variables compared have non-identical attributes.