Search code examples
rtidycensus

predicting race from surname and party affiliation


I am trying to use the wru package to predict race based on surname and location for a sample of individuals in the US using their names and addresses. The documentation for the predict_race function can be found here.

However, I am encountering an error when trying to run the function and have not been able to successfully execute it. This function would be incredibly useful for my analysis, so I'm hoping someone can help me understand if the package is buggy or if I'm doing something wrong.

This is the code I'm using:

output <- wru::predict_race(
  voter.file = df %>%
    mutate(county = sprintf('%03d', county),
           tract = sprintf('%06d', tract)) %>%
    filter(!is.na(tract)),
  census.geo = "tract",
  census.key = census_api_key, # obtained from: https://api.census.gov/data/key_signup.html
  party = "party_code")

Note that the documentation mentions the formatting of the geographic indicators that should be passed to the function:

  • "the two-character abbreviation for each individual's state of residence (e.g., "nj" for New Jersey)"
  • "County is three characters (e.g., "031" not "31"), tract is six characters"

That's why I'm formatting county and tract as shown in the code above. I've shared a snippet of data below which can be used to replicate the error i'm getting.

This is the error i'm getting:

County 1 of 1: 071
Proceeding with last name predictions...
ℹ Downloading "wru-data-census_last_c.rds"...
  |=======================================================================================| 100%
ℹ Downloading "wru-data-first_c.rds"...
  |=======================================================================================| 100%
ℹ Downloading "wru-data-last_c.rds"...
  |=======================================================================================| 100%
ℹ Downloading "wru-data-mid_c.rds"...
  |=======================================================================================| 100%
Proceeding with Census geographic data at tract level...
Using Census geographic data from provided census.data object...
State 1 of 1: OR
Error in census_helper_new(key = census.key, voter.file = voter.file,  : 
  The following locations in the voter.file are not available in the census data (listed as state-county-tract):
OR-071-030303

My thinking was that it's some county + tract combination that the function doesn't like, so perhaps I could split my data into small groups (say of n=10) and try passing each through to the function iteratively, saving any successful output into a csv. I can then store and revisit the groups that failed, split them into increasingly small sizes until, hopefully, at least some of the names get predictions. I attempted this but am getting exactly the same error and the loop breaks.

See below for data for reprex

df <- tibble::tribble(
  ~surname, ~state, ~county, ~tract, ~party_code,
  "ALEXANDER",   "OR",     71L, 30101L,       "NAV",
  "AQUIPEL",   "OR",     71L, 30101L,       "NAV",
  "BABBITT",   "OR",     71L, 30101L,       "NAV",
  "BACKUS",   "OR",     71L, 30101L,       "DEM",
  "BACKUS",   "OR",     71L, 30101L,       "DEM",
  "BARKER",   "OR",     71L, 30101L,       "DEM",
  "BARTMAN",   "OR",     71L, 30303L,       "REP",
  "BARTMAN",   "OR",     71L, 30303L,       "REP",
  "BASS",   "OR",     71L, 30101L,       "DEM",
  "BATTERMAN",   "OR",     71L, 30303L,       "NAV",
  "BATTERMAN",   "OR",     71L, 30303L,       "NAV",
  "BEARDEN",   "OR",     71L, 30101L,       "NAV",
  "BELANDER",   "OR",     71L, 30101L,       "NAV",
  "BELL",   "OR",     71L, 30303L,       "NAV",
  "BEM",   "OR",     71L, 30101L,       "NAV",
  "BENNETT",   "OR",     71L, 30102L,       "NAV",
  "BERG",   "OR",     71L, 30101L,       "NAV",
  "BERGER",   "OR",     71L, 30303L,       "NAV",
  "BESEAU",   "OR",     71L, 30303L,       "NAV",
  "BIERER",   "OR",     71L, 30101L,       "IND",
  "BILLETTE",   "OR",     71L, 30303L,       "IND",
  "BISCHOFF",   "OR",     71L, 30101L,       "NAV",
  "BLATT",   "OR",     71L, 30101L,       "NAV",
  "BOCHART",   "OR",     71L, 30101L,       "NAV",
  "BOWLIN",   "OR",     71L, 30202L,       "NAV",
  "BURGESS",   "OR",     71L, 30303L,       "NAV",
  "BURNETT",   "OR",     71L, 30101L,       "NAV",
  "BURNETT",   "OR",     71L, 30101L,       "NAV",
  "BYE ODEA",   "OR",     71L, 30101L,       "NAV",
  "BYINGTON",   "OR",     71L, 30101L,       "NAV",
  "CARSLEY",   "OR",     71L, 30102L,       "NAV",
  "CARTWRIGHT",   "OR",     71L, 30101L,       "NAV",
  "CATES",   "OR",     71L, 30101L,       "NAV",
  "CHANDLER",   "OR",     71L, 30101L,       "NAV",
  "CHESHIER",   "OR",     71L, 30102L,       "NAV",
  "CISNEROS",   "OR",     71L, 30303L,       "NAV",
  "COE",   "OR",     71L, 30101L,       "NAV",
  "CORREA",   "OR",     71L, 30303L,       "NAV",
  "COSHOW",   "OR",     71L, 30101L,       "NAV",
  "COURTNEY",   "OR",     71L, 30101L,       "NAV",
  "CROFT",   "OR",     71L, 30101L,       "NAV",
  "CROSSLAND",   "OR",     71L, 30101L,       "NAV",
  "CRUZ",   "OR",     71L, 30102L,       "NAV",
  "CULLENS",   "OR",     71L, 30101L,       "NAV",
  "CURRIER",   "OR",     71L, 30101L,       "NAV",
  "DAHME",   "OR",     71L, 30303L,       "DEM",
  "DAHME",   "OR",     71L, 30303L,       "DEM",
  "DAVIS",   "OR",     71L, 30303L,       "NAV",
  "DAVIS",   "OR",     71L, 30101L,       "NAV",
  "DEHART",   "OR",     71L, 30303L,       "NAV",
  "DENMAN",   "OR",     71L, 30101L,       "NAV",
  "DENNIS",   "OR",     71L, 30101L,       "NAV",
  "DILLESHAW",   "OR",     71L, 30101L,       "NAV",
  "DOOTSON",   "OR",     71L, 30101L,       "NAV",
  "EIDE",   "OR",     71L, 30101L,       "NAV",
  "EILERS",   "OR",     71L, 30101L,       "NAV",
  "EKREN",   "OR",     71L, 30101L,       "DEM",
  "ELLIS",   "OR",     71L, 30101L,       "NAV",
  "ERICKSON",   "OR",     71L, 30101L,       "NAV",
  "ESKELSEN",   "OR",     71L, 30101L,       "NAV",
  "EVANS",   "OR",     71L, 30303L,       "NAV",
  "FETTIG",   "OR",     71L, 30102L,       "NAV",
  "FINDLEY",   "OR",     71L, 30102L,       "NAV",
  "FLANAGAN",   "OR",     71L, 30101L,       "DEM",
  "FRAYCHINEAUD",   "OR",     71L, 30102L,       "NAV",
  "FREY",   "OR",     71L, 30101L,       "NAV"
)

Solution

  • So it turns out if I change the line census.geo = "tract" to census.geo = "county" the code runs fine! Not a direct answer, as the package claims that I can get predictions at the tract level, but good enough!