Search code examples
rvariablescomparisontidyversevariable-names

How to check condition based on a variable match


Say that I have these data:

data <- tibble(person=c("Jack", "Jill", "Bill"), pet=c("dog", "cat", "zebra"), pet_cat=c(0,1,0), pet_dog=c(0,1,1), pet_llama=c(1,1,1))

  person pet   pet_cat pet_dog pet_llama
  <chr>  <chr>   <dbl>   <dbl>     <dbl>
1 Jack   dog         0       0         1
2 Jill   cat         1       1         1
3 Bill   zebra       0       1         1

What I want to do is, for each person, to first find out which pet they have (Jack has a dog) and then go to the column that includes this pet (for Jack, this is the pet_dog column). And then make a new column, match, that copies the value from the pet_ column (for Jack, this is 0 because pet_dog's value for Jack is 0). Not sure if this is necessary, but one way to think about this is that pet refers to the person's preference for a pet and the pet_ columns refer to the availability of the pet in the pet store, and match tells if the person is able to buy the pet of their preference.

Furthermore, in some cases, there will not be a pet_ column matching the person's pet. In these cases, the match should be 0.

Note that for one person, say Jill, the other people's values are totally irrelevant, as are the pet_ columns that do not correspond to Jill's pet value.

This is the desired result:

data_want <- tibble(person=c("Jack", "Jill", "Bill"), pet=c("dog", "cat", "zebra"), pet_cat=c(0,1,0), pet_dog=c(0,1,1), pet_llama=c(1,1,1), match=c(0, 1, 0))

  person pet   pet_cat pet_dog pet_llama match
  <chr>  <chr>   <dbl>   <dbl>     <dbl> <dbl>
1 Jack   dog         0       0         1     0
2 Jill   cat         1       1         1     1
3 Bill   zebra       0       1         1     0

How can I do this?

Note that I think it will involve putting the person's pet value into a variable using something like get or assign combined with paste0, then going to this column and pulling the relevant value and putting it in match.


Solution

  • An option with tidyverse

    1. Reshape to long format with pivot_longer
    2. Subset the rows where value is 1
    3. Create a logical column by matching the substring from 'pet' with 'name' column - str_detect
    4. Grouped by 'pet', check if there are any TRUE in 'match', coerce it to binary (+)
    5. Join with original dataset - right_join
    6. If we want to keep the order from original data, create a sequence column (row_number()) and order the rows (arrange)
    library(dplyr)
    library(tidyr)
    library(stringr)
    data %>%
         pivot_longer(cols = contains('_')) %>% 
         filter(value == 1) %>% 
         mutate(match = str_detect(name, pet)) %>%
         group_by(pet) %>%
         summarise(match = +(any(match))) %>% 
         right_join(data %>% 
                       mutate(rn = row_number())) %>% 
         arrange(rn) %>% 
         select(names(data), match)
    

    -output

    # A tibble: 3 x 6
      person pet   pet_cat pet_dog pet_llama match
      <chr>  <chr>   <dbl>   <dbl>     <dbl> <int>
    1 Jack   dog         0       0         1     0
    2 Jill   cat         1       1         1     1
    3 Bill   zebra       0       1         1     0
    

    Or may use rowwise

    1. Create the rowwise attribute - grouped by row
    2. Use c_across to create a logical vector i.e. where value is 1
    3. Use the logical index to subset the column names that starts_with 'pet_'`,
    4. Remove the substring 'pet_' - str_remove
    5. Create a single string by concatenating them - str_c
    6. Remove the group attribute (ungroup) and use str_detect to detect if the 'pet' column values have a match with regex pattern created
    data %>%
       rowwise %>% 
       mutate(match =  str_c(str_remove(names(select(cur_data(), 
          contains('_')))[c_across(contains("_")) == 1], ".*_"), 
              collapse="|")) %>%
       ungroup %>% 
       mutate(match = +(str_detect(pet, match)))
    # A tibble: 3 x 6
      person pet   pet_cat pet_dog pet_llama match
      <chr>  <chr>   <dbl>   <dbl>     <dbl> <int>
    1 Jack   dog         0       0         1     0
    2 Jill   cat         1       1         1     1
    3 Bill   zebra       0       1         1     0
    

    Or using base R

    1. Select the column names that contains 'pet_' (nm1)
    2. Create a row/column index based on sequence of rows and matching the substring of column name with 'pet' column
    3. Get the corresponding elements from the selected columns of data with the use of 2 as a matrix
    4. Replace the elements that are NA (i.e. no match) to 0
    nm1 <- names(data)[startsWith(names(data), "pet_")]
    data$match <- as.data.frame(data[nm1])[cbind(seq_len(nrow(data)), 
           match(data$pet, sub("pet_", "", nm1)))]
    data$match[is.na(data$match)] <- 0
    

    -output

    data
    # A tibble: 3 x 6
      person pet   pet_cat pet_dog pet_llama match
      <chr>  <chr>   <dbl>   <dbl>     <dbl> <dbl>
    1 Jack   dog         0       0         1     0
    2 Jill   cat         1       1         1     1
    3 Bill   zebra       0       1         1     0