A survey was conducted and one of the questions had an option to select multiple answers. When selecting multiple answers, they were all recorded in the same cell.
Additionally, the way that this information was recorded in the cell was different for each surveyor. Sometimes the delimiter was a hyphen (-) and othertimes it was a front slash (/). Also some surveyors listed the items with numbers.
An example is a list of items in a house (see below/image). I would like to create columns when each item is available (the new columns could either have 1/0 or the item name/NA) (see below outcome example).
I could do this in excel using text-to-column and lookup arrays, but there are so many excel sheets with this same column that I'd have to do it in R. Sorry I don't know how to make the example table using R code, but hopefully someone would be able to help anyway.
This is what the data looks like:
House = c("h1","h2","h3","h4","h5","h6","h7","h8","h9","h10","h11")
Items = c("Chair", "Chair- Window/Glass- "," Door- Sofa-", "Chair-
Window/Glass Frame- ", "1. Window/Glass Frame", "Chair- Door- Window-", "Chair- Sofa - Door- Table-", " 4. Table", "Couch (2)", "Window- Table- Chair- Sofa- Door- Couach", "2. Door / Chair")
table1 = as.data.table(House)
table2 = as.data.table(Items)
table = as.data.frame(append(table1, table2))
+-------+------------------------------------------+
| House | Items |
+-------+------------------------------------------+
| 001 | Chair |
| 002 | Chair- Window/Glass- |
| 003 | Door- Sofa- |
| 004 | Chair- Window/Glass Frame- |
| 005 | 1. Window/Glass Frame |
| 006 | Chair- Door- Window- |
| 007 | Chair- Sofa - Door- Table- |
| 008 | 4. Table |
| 009 | Couch (2) |
| 010 | Window- Table- Chair- Sofa- Door- Couach |
| 011 | 2. Door / Chair |
+-------+------------------------------------------+
My thought was to split using all delimiters (strsplit), remove whitespaces (trimws), get a unique list (unique), then replace all variations with the standard I want (grepl), and finally put them into columns according to categories.
items <- strsplit(df$Items, "[/.-]")
items <- trimws(items)
items <- df$Items %>%
strsplit("[/.-]") %>%
str_trim(side = "both")
items_list <- unique(items)
This is what I am trying to get: (Window and glass are the same, and chair/sofa/couch are the same, etc -- so I just need to create larger categories instead of having several columns of essentially the same thing)
+-------+-------+--------+-------+------+
| House | Chair | Window | Table | Door |
+-------+-------+--------+-------+------+
| 001 | Chair | | | |
| 002 | Chair | Window | | |
| 003 | Chair | | | Door |
| 004 | Chair | Window | | |
| 005 | | Window | | |
| 006 | Chair | Window | | Door |
| 007 | Chair | | Table | Door |
| 008 | | | Table | |
| 009 | Chair | | | |
| 010 | Chair | Window | Table | Door |
| 011 | Chair | | | Door |
+-------+-------+--------+-------+------+
You can use str_detect
(or grepl
) in map_df
(or sapply
) to generate a dataframe of logicals, coerce those to integers 0/1, and then bind that to your original dataframe. This method bypasses the hassle of splitting/cleaning/etc. the data. It simply requires that you first create pattern groups for your regex, i.e. chair|sofa|couach|couch
, window|glass
:
library(stringr)
library(dplyr)
library(purrr)
# Create regex pattern groups.
patts <- c(chair = "chair|sofa|couach|couch", window = "window|glass",
table = "table", door = "door")
# Detect pattern groups, coerce to 0/1, bind to origional dataframe.
map_df(patts, ~ str_detect(df$Items, regex(., ignore_case = T))) %>%
mutate_all(as.integer) %>%
bind_cols(df, .)
This returns the following dataframe:
# A tibble: 11 x 6
House Items chair window table door
<dbl> <chr> <int> <int> <int> <int>
1 1 Chair 1 0 0 0
2 2 "Chair- Window/Glass- " 1 1 0 0
3 3 " Door- Sofa-" 1 0 0 1
4 4 "Chair- Window/Glass Frame- " 1 1 0 0
5 5 1. Window/Glass Frame 0 1 0 0
6 6 Chair- Door- Window- 1 1 0 1
7 7 Chair- Sofa - Door- Table- 1 0 1 1
8 8 " 4. Table" 0 0 1 0
9 9 Couch (2) 1 0 0 0
10 10 Window- Table- Chair- Sofa- Door- Couach 1 1 1 1
11 11 2. Door / Chair 1 0 0 1
df <- tibble(House = c(1,2,3,4,5,6,7,8,9,10,11), Items = c("Chair", "Chair- Window/Glass- "," Door- Sofa-", "Chair- Window/Glass Frame- ", "1. Window/Glass Frame", "Chair- Door- Window-", "Chair- Sofa - Door- Table-", " 4. Table", "Couch (2)", "Window- Table- Chair- Sofa- Door- Couach", "2. Door / Chair"))