Search code examples
rdplyrapplylapplysapply

How to record presence/absence of splitted elements of large strings in individual dataframe columns


I have a dataframe with long and messy strings of household amenities. I want to break the strings into unique amenities, create for each unique amenity a new column in the dataframe, and record in the new columns the presence/absence of the individual amenities in the strings. Using nested for loops I've found a way to accomplish the task. However, what I'd like to know is how the same result can be obtained using one of the apply family of functions or a dplyr method to avoid the loops.

Reproducible data:

df <- data.frame(
  id = 1:4,
  amenities = c('{"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating}',
                '{TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed"}',
                '{"Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector",Carbon monoxide}',
                '{Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"}'))

What I've done so far is this:

amenities_clean <- gsub('[{}"]', '', df$amenities) # remove unwanted stuff 
amenities_split <- strsplit(amenities_clean, ",") # split rows into individual amenities
amenities_unique <- unique(unlist(strsplit(amenities_clean, ","))) # get a list of unique amenities 
df[amenities_unique] <- NA # set up the columns for each amenity

To record in the new columns whether the individual amenities are present or absent in the strings, I'm using str_detect as well as nested for loops:

# record presence/absence of individual amenities in each new column:
library(stringr)
for(i in 1:ncol(df[amenities_unique])){
  for(j in 1:nrow(df)){
    df[amenities_unique][j,i] <- 
      ifelse(str_detect(amenities_split[j], names(df[amenities_unique][i])), 1, 0)
  }
}

While this produces warnings, they seem not to be harmful, as the result looks okay:

df
  id                                                                                                amenities Wireless Internet
1  1        {"Wireless Internet","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating}                 1
2  2 {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning",Kitchen,"Smoking allowed","Pets allowed"}                 1
3  3              {"Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector",Carbon monoxide}                 0
4  4                                    {Washer,Dryer,Essentials,Shampoo,Hangers,"Laptop friendly workspace"}                 0
  Wheelchair accessible Kitchen Elevator Buzzer/wireless intercom Heating TV Cable TV Internet Air conditioning Smoking allowed
1                     1       1        1                        1       1  0        0        1                0               0
2                     0       1        0                        0       0  1        1        1                1               1
3                     0       0        0                        1       1  0        0        0                0               0
4                     0       0        0                        0       0  0        0        0                0               0
  Pets allowed Family/kid friendly Smoke detector Carbon monoxide Washer Dryer Essentials Shampoo Hangers Laptop friendly workspace
1            0                   0              0               0      0     0          0       0       0                         0
2            1                   0              0               0      0     0          0       0       0                         0
3            0                   1              1               1      0     0          0       0       0                         0
4            0                   0              0               0      1     1          1       1       1                         1

Given the warnings and given the convolutedness of nested loops, how can the same result be obtained using a function from the applyfamily of functions or using dplyr?


Solution

  • Once you clean the amenities , you can use cSplit_e from splitstackshape.

    df$amenities_clean <- gsub('[{}"]', '', df$amenities) 
    splitstackshape::cSplit_e(df, "amenities_clean", type = "character", fill = 0)
    

    To solve it using one of the apply functions we can do :

    temp <- strsplit(df$amenities_clean, ",")
    amenities_unique <- unique(unlist(temp))
    cbind(df, t(sapply(temp, function(x) 
                       table(factor(x, levels = amenities_unique)))))