I have a dataframe with long and messy strings of household amenities. I want to break the strings into unique amenities, create for each unique amenity a new column in the dataframe, and record in the new columns the presence/absence of the individual amenities in the strings.
Using nested for
loops I've found a way to accomplish the task. However, what I'd like to know is how the same result can be obtained using one of the apply
family of functions or a dplyr
method to avoid the loops.
Reproducible data:
df <- data.frame(
id = 1:4,
amenities = c('{"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating}',
'{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed"}',
'{"Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector",Carbon monoxide}',
'{Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"}'))
What I've done so far is this:
amenities_clean <- gsub('[{}"]', '', df$amenities) # remove unwanted stuff
amenities_split <- strsplit(amenities_clean, ",") # split rows into individual amenities
amenities_unique <- unique(unlist(strsplit(amenities_clean, ","))) # get a list of unique amenities
df[amenities_unique] <- NA # set up the columns for each amenity
To record in the new columns whether the individual amenities are present or absent in the strings, I'm using str_detect
as well as nested for
loops:
# record presence/absence of individual amenities in each new column:
library(stringr)
for(i in 1:ncol(df[amenities_unique])){
for(j in 1:nrow(df)){
df[amenities_unique][j,i] <-
ifelse(str_detect(amenities_split[j], names(df[amenities_unique][i])), 1, 0)
}
}
While this produces warnings, they seem not to be harmful, as the result looks okay:
df
id amenities Wireless Internet
1 1 {"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating} 1
2 2 {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed"} 1
3 3 {"Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector",Carbon monoxide} 0
4 4 {Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"} 0
Wheelchair accessible Kitchen Elevator Buzzer/wireless intercom Heating TV Cable TV Internet Air conditioning Smoking allowed
1 1 1 1 1 1 0 0 1 0 0
2 0 1 0 0 0 1 1 1 1 1
3 0 0 0 1 1 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0
Pets allowed Family/kid friendly Smoke detector Carbon monoxide Washer Dryer Essentials Shampoo Hangers Laptop friendly workspace
1 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0
3 0 1 1 1 0 0 0 0 0 0
4 0 0 0 0 1 1 1 1 1 1
Given the warnings and given the convolutedness of nested loops, how can the same result be obtained using a function from the apply
family of functions or using dplyr
?
Once you clean the amenities , you can use cSplit_e
from splitstackshape
.
df$amenities_clean <- gsub('[{}"]', '', df$amenities)
splitstackshape::cSplit_e(df, "amenities_clean", type = "character", fill = 0)
To solve it using one of the apply functions we can do :
temp <- strsplit(df$amenities_clean, ",")
amenities_unique <- unique(unlist(temp))
cbind(df, t(sapply(temp, function(x)
table(factor(x, levels = amenities_unique)))))