Search code examples
rdplyr

Mutating a variable using a different-dimensioned lookup table


I have a dataset that includes individuals nested within countries. One of the variables is the individuals' native language. I also have another dataset, let's call it lookup, that includes a list of national languages for each country. For example (my actual data differs, but has the same basic structure):

individuals <- data.frame(
  cntry = c("AT", "AT", "AT", "BE", "BE", "BE", "HU"),
  lang = c("GER", "ENG", "FRE", "FRE", "DUT", "ARA", "HUN"))

languages <- data.frame(
  lcntry = c("AT", "DE", "BE", "BE", "HU"),
  llang = c("GER", "GER", "FRE", "DUT", "HUN")
)

I want to generate a new logical variable in the individuals data frame which tells me whether the language recorded for each individual is contained in the respective country's national language list.

To clarify: I want natlang to only be TRUE if the language is listed as a national language for that specific country, so just doing this will not work for me:

individuals <- individuals %>% mutate(natlang = lang %in% languages$llang)

I have tried the following, which runs, but gives incorrect results:

individuals <- individuals %>% 
  mutate(natlang = lang %in% languages$llang[languages$lcntry == cntry])

I have also tried the following:

individuals <- individuals %>% 
  mutate(natlang = lang %in% filter(languages, lcntry == cntry)$llang)

This fails with the error:

Error in `mutate()`:
ℹ In argument: `natlang = lang %in% filter(languages, lcntry == cntry)$lang`.
Caused by error in `filter()`:
ℹ In argument: `lcntry == cntry`.
Caused by error:
! `..1` must be of size 5 or 1, not size 7.
Backtrace:
  1. individuals %>% ...
 17. dplyr:::dplyr_internal_error(...)

I assume that my problem has to do with the two data frames being of different lengths, and mutate trying to vectorize everything to the length of the first data frame, but I'm not sure about this.


Solution

  • You can simply left join:

    left_join(
      individuals,
      mutate(languages, natlang=TRUE),
      by=c("cntry" = "lcntry", "lang"="llang")
    )
    

    Output:

      cntry lang natlang
    1    AT  GER    TRUE
    2    AT  ENG      NA
    3    AT  FRE      NA
    4    BE  FRE    TRUE
    5    BE  DUT    TRUE
    6    BE  ARA      NA
    7    HU  HUN    TRUE