Search code examples
rdataframeyelp

R - Yelp data Business category column has multiple categories per business. Want to separate into category specific columns with values of 1 and 0


thank you in advance for anyone who is going to try and help with this.

I'm using the Yelp data set and the question I want to answer is "which categories are positively correlated with higher stars for X category (Bars for example)"

The issue I'm encountering is that for each business the categories are lumped together into one column and row per businesss_id. So I need a means to separate out each category, turn them into columns and then check if the original category column contains the category that the column was created for.

My current train of thought is to use group_by with business_id and then unnest_tokens the column, then model.matrix() that column into the split I want and then join it onto the df I'm using. But I can't get model.matrix to pass and keep business_id connected to each row.

# an example of what I am using #
df <- 
  data_frame(business_id = c("bus_1",
                             "bus_2", 
                             "bus_3"),
             categories=c("Pizza, Burgers, Caterers",
                          "Pizza, Restaurants, Bars",
                          "American, Barbeque, Restaurants"))

# what I want it to look like #
desired_df <- 
  data_frame(business_id = c("bus_1",
                             "bus_2",
                             "bus_3"),
             categories=c("Pizza, Burgers, Caterers",
                          "Pizza, Restaurants, Bars",
                          "American, Barbeque, Restaurants"),
             Pizza = c(1, 1, 0),
             Burgers = c(1, 0, 0),
             Caterers = c(1, 0, 0),
             Restaurants = c(0, 1, 1),
             Bars = c(0, 1, 0),
             American = c(0, 0, 1),
             Barbeque = c(0, 0, 1))

# where I am stuck #
df %>%
  select(business_id, categories) %>% 
  group_by(business_id) %>% 
  unnest_tokens(categories, categories, token = 'regex', pattern=", ") %>%
  model.matrix(business_id ~ categories, data = .) %>% 
  as_data_frame

Edit: After this post and the answers below I encountered a duplicate identifiers error using spread(). Which brought me to this thread https://github.com/tidyverse/tidyr/issues/426 where the answer to my question was posted, I've repasted it below.

# duplicating the error with a smaller data.frame #

library(tidyverse)
 df <- structure(list(age = c("21", "17", "32", "29", "15"), 
                        gender = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("Female", "Male"), class = "factor")), 
                   row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("age", "gender"))
 df
#> # A tibble: 5 x 2
#>   age   gender
#>   <chr> <fct> 
#> 1 21    Male  
#> 2 17    Female
#> 3 32    Female
#> 4 29    Male  
#> 5 15    Male  

df %>% 
  spread(key=gender, value=age)
#> Error: Duplicate identifiers for rows (2, 3), (1, 4, 5)

# fixing the problem #

df %>% 
  group_by_at(vars(-age)) %>%  # group by everything other than the value column. 
  mutate(row_id=1:n()) %>% ungroup() %>%  # build group index
  spread(key=gender, value=age) %>%    # spread
  select(-row_id)  # drop the index

#> # A tibble: 3 x 2
#>   Female Male 
#>   <chr>  <chr>
#> 1 17     21   
#> 2 32     29   
#> 3 NA     15   

Solution

  • Building from your nice use of tidytext::unnest_tokens(), you can also use this alternative solution

    library(dplyr)
    library(tidyr)
    library(tidytext)
    
    df %>%
      select(business_id, categories) %>% 
      group_by(business_id) %>% 
      unnest_tokens(categories, categories, token = 'regex', pattern=", ") %>% 
      mutate(value = 1) %>% 
      spread(categories, value, fill = 0)
    # business_id american barbeque  bars burgers caterers pizza restaurants
    # <chr>          <dbl>    <dbl> <dbl>   <dbl>    <dbl> <dbl>       <dbl>
    # bus_1              0        0     0       1        1     1           0
    # bus_2              0        0     1       0        0     1           1
    # bus_3              1        1     0       0        0     0           1