I have this data frame glimpse(df)
Observations: 2,211
Variables: 3
$ city <chr> "Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas", "Las Veg...
$ categories <chr> "c(\"Korean\", \"Sushi Bars\")", "c(\"Japanese\", \"Sushi Bars\")", "Tha...
$ is_open <chr> "0", "0", "1", "0", "1", "1", "0", "1", "0", "1", "1", "1", "0", "1", "1...
Here is a small dput()
structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas",
"Phoenix", "Las Vegas"), categories = c("c(\"Korean\", \"Sushi Bars\")",
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")",
"Korean"), is_open = c("0", "0", "1", "0", "1")), .Names = c("city",
"categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")
The data consists different cities city
with different cuisines categories
.
I want to make a contingency table to visualize which cuisines are associated with closings (is_opem = 0)
or openings (is_open = 1)
.
I want to do this with a contingency table. To do so I tried this one but I got this error:
xtabs(is_open ~., data = df)
Error in FUN(X[[i]], ...) : invalid 'type' (character) of argument
When I convert the variables as.factor()
I get lots of tables, not one. Is there a way to this so ít´s looking like below?
Categorie/City Las Vegas Pittsburgh
Korean 50/50 30/70
Sushi Bars 40/60 40/60
The numbers in the columns are the frequencies of the closings (is_opem = 0)
and openings (is_open = 1)
for each category per city (e.g. for Korean in Las Vegas the distribution for closings(0) and openings(1) is 50/50).
Here a solution using data.table
to cast
your data using a counting function based on stri_count
from stringi
package. The latter might also be achieved via table
or sum(grepl())
with ifelse
constructs (depending on the required flexibility with respect to the data structure, speed requirements, etc.). Please note that I have also reformatted your data into more clean "long format" with the aid of this answer. This reformatting might be skipped if you format your data this way right from the start. I hope this is what you are looking for.
#your data
df <- structure(list(city = c("Las Vegas", "Pittsburgh", "Las Vegas", "Phoenix", "Las Vegas")
,categories = c("c(\"Korean\", \"Sushi Bars\")",
"c(\"Japanese\", \"Sushi Bars\")", "Thai", "c(\"Sushi Bars\", \"Japanese\")",
"Korean")
,is_open = c("0", "0", "1", "0", "1"))
,.Names = c("city", "categories", "is_open"), row.names = c(NA, 5L), class = "data.frame")
library(data.table)
library(stringi)
#format data to correct "long format"
DT <- as.data.table(df)
DT[, categories := gsub("c\\(\"|\"|\"\\)", "", categories)]
DT <- DT[, .(categories = unlist(strsplit(as.character(categories), ", ", fixed = TRUE))),
by = .(city, is_open)]
# city is_open categories
# 1: Las Vegas 0 Korean
# 2: Las Vegas 0 Sushi Bars
# 3: Pittsburgh 0 Japanese
# 4: Pittsburgh 0 Sushi Bars
# 5: Las Vegas 1 Thai
# 6: Las Vegas 1 Korean
# 7: Phoenix 0 Sushi Bars
# 8: Phoenix 0 Japanese
#specify all_unique_count_items to also cover items that are not present in x
calc_count_distr <- function(x, all_unique_count_items) {
count_distribution <- sapply(all_unique_count_items, function(y) {
100*round(sum(stri_count_fixed(x, y))/length(x), d =2)
})
paste(count_distribution, collapse = "/")
}
dcast.data.table(DT, categories ~ city, value.var = "is_open"
,fun.aggregate = function(x) calc_count_distr(x, all_unique_count_items = unique(DT$is_open))
,fill = NA)
# categories Las Vegas Phoenix Pittsburgh
#1: Japanese NA 100/0 100/0
#2: Korean 50/50 NA NA
#3: Sushi Bars 100/0 100/0 100/0
#4: Thai 0/100 NA NA