Search code examples
rcontingency

contingency table with strings


I have this data frame glimpse(df)

Observations: 2,211
Variables: 3
$ city       <chr> "Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas", "Las Veg...
$ categories <chr> "c(\"Korean\", \"Sushi Bars\")", "c(\"Japanese\", \"Sushi Bars\")", "Tha...
$ is_open    <chr> "0", "0", "1", "0", "1", "1", "0", "1", "0", "1", "1", "1", "0", "1", "1...

Here is a small dput()

structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", 
"Phoenix", "Las Vegas"), categories = c("c(\"Korean\", \"Sushi Bars\")", 
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")", 
"Korean"), is_open = c("0", "0", "1", "0", "1")), .Names = c("city", 
"categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")

The data consists different cities city with different cuisines categories. I want to make a contingency table to visualize which cuisines are associated with closings (is_opem = 0) or openings (is_open = 1).

I want to do this with a contingency table. To do so I tried this one but I got this error:

xtabs(is_open ~., data = df)

Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument

When I convert the variables as.factor() I get lots of tables, not one. Is there a way to this so ít´s looking like below?

Categorie/City          Las Vegas     Pittsburgh
           Korean       50/50         30/70
           Sushi Bars   40/60         40/60

The numbers in the columns are the frequencies of the closings (is_opem = 0) and openings (is_open = 1) for each category per city (e.g. for Korean in Las Vegas the distribution for closings(0) and openings(1) is 50/50).


Solution

  • Here a solution using data.table to cast your data using a counting function based on stri_count from stringi package. The latter might also be achieved via table or sum(grepl()) with ifelse constructs (depending on the required flexibility with respect to the data structure, speed requirements, etc.). Please note that I have also reformatted your data into more clean "long format" with the aid of this answer. This reformatting might be skipped if you format your data this way right from the start. I hope this is what you are looking for.

    #your data
    df <- structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas")
                           ,categories = c("c(\"Korean\", \"Sushi Bars\")", 
                                         "c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")", 
                                         "Korean")
                           ,is_open = c("0", "0", "1", "0", "1"))
                           ,.Names = c("city",  "categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")
    
    library(data.table)
    library(stringi)                                  
    
    #format data to correct "long format"
    DT <- as.data.table(df)
    DT[, categories := gsub("c\\(\"|\"|\"\\)", "", categories)]
    DT <- DT[, .(categories = unlist(strsplit(as.character(categories), ", ", fixed = TRUE))), 
             by = .(city, is_open)]
    #           city is_open categories
    # 1:  Las Vegas       0     Korean
    # 2:  Las Vegas       0 Sushi Bars
    # 3: Pittsburgh       0   Japanese
    # 4: Pittsburgh       0 Sushi Bars
    # 5:  Las Vegas       1       Thai
    # 6:  Las Vegas       1     Korean
    # 7:    Phoenix       0 Sushi Bars
    # 8:    Phoenix       0   Japanese
    
    #specify all_unique_count_items to also cover items that are not present in x
    calc_count_distr <-  function(x, all_unique_count_items) {
    
        count_distribution <- sapply(all_unique_count_items, function(y) {
                                         100*round(sum(stri_count_fixed(x, y))/length(x), d =2)
                                    })
        paste(count_distribution, collapse = "/")
    }
    
    dcast.data.table(DT, categories ~ city, value.var = "is_open"
                     ,fun.aggregate = function(x) calc_count_distr(x, all_unique_count_items = unique(DT$is_open))
                     ,fill = NA)
    #   categories Las Vegas Phoenix Pittsburgh
    #1:   Japanese        NA   100/0      100/0
    #2:     Korean     50/50      NA         NA
    #3: Sushi Bars     100/0   100/0      100/0
    #4:       Thai     0/100      NA         NA