Search code examples
rtidyversedata-cleaning

splitting a column in columns, different number of elements


I have a column to split made like A, with a different number of elements(genres) for each observations(song). I can I split the column without specifing the destination columns in R?

Column A
"['hip hop', 'pop', 'pop rap', 'r&b', 'southern hip hop', 'trap', 'trap soul']"
"['dance pop', 'girl group', 'pop', 'post-teen pop', 'talent show', 'uk pop']"

I'm trying to have a result like that

genre 1 genre 2 genre ... genre 6 genre 7
Hip Hop pop .... trap Neo soul
dance pop girl group .... Uk pop N/A

with the number of new columns equal to the maximum number of genre a song can have( for example if the song with more genres has ten genres, I should have have ten columns) .

Another option is to create a dummy column for every genre found in the column

Hip Hop Pop Pop Rap r&b ....
1 1 1 1 ....
0 1 0 0 ....

I tried with separate in R but it gave me error


Solution

  • In base R, we could use read.csv after removing the [, ], and quotes (', ")

    df2 <- read.csv(text = gsub('\\[|\\]|\'|"', "", df1$ColumnA), 
     header = FALSE, na.strings = "", col.names = paste0("genre", 1:7))
    

    -output

    df2
         genre1      genre2   genre3         genre4            genre5  genre6     genre7
    1   hip hop         pop  pop rap            r&b  southern hip hop    trap  trap soul
    2 dance pop  girl group      pop  post-teen pop       talent show  uk pop       <NA>
    

    The second dataset can be created with mtabulate on the output above

    library(qdapTools)
    mtabulate(as.data.frame(t(df2)))
    

    -output

        girl group  pop  pop rap  post-teen pop  r&b  southern hip hop  talent show  trap  trap soul  uk pop dance pop hip hop
    V1           0    1        1              0    1                 1            0     1          1       0         0       1
    V2           1    1        0              1    0                 0            1     0          0       1         1       0
    

    data

    df1 <- structure(list(ColumnA = c("['hip hop', 'pop', 'pop rap', 'r&b', 
    'southern hip hop', 'trap', 'trap soul']", 
    "['dance pop', 'girl group', 'pop', 'post-teen pop', 'talent show', 'uk pop']"
    )), class = "data.frame", row.names = c(NA, -2L))