Search code examples
rdataframeaggregationanalysis

aggregating columns based on commas


I have the following dataframe and I'm trying to separate the commas and turn that particular name(s) into their own individual columns and specify if that particular column names exist (which are separated by commas) for that particular ID. (1 = Yes, 0 = No) Any help would be appreciated! Thanks!

ID<- c(1,2,3,4,5,6)
Details<- c("V1,V2", "V1,V3", "V1", "V2", "V3,V4", "V2,V3" )

data.frame <- data.frame(ID, Details, stringsAsFactors=FALSE)

DESIRED OUTPUT:

ID<-c(1,2,3,4,5,6)
V1<-c(1,1,1,0,0,0)
V2<-c(1,0,0,1,0,1)
V3<-c(0,1,0,0,1,1)
V4<-c(0,0,0,0,1,0)

data.frame1<-data.frame(ID, V1, V2, V3, V4, stringsAsFactors=FALSE)

Solution

  • The most straightforward way I see is to would be to build a data.frame for each of these vectors hidden in strings and bind them. purrr can help to make it quite compact. Note that column ID isn't needed, I'll work on Details directly.

    library(purrr)
    df <- map_dfr(strsplit(Details, ","),
                  ~data.frame(t(setNames(rep(1, length(.x)), .x))))
    df[is.na(df)] <- 0
    
    #   V1 V2 V3 V4
    # 1  1  1  0  0
    # 2  1  0  1  0
    # 3  1  0  0  0
    # 4  0  1  0  0
    # 5  0  0  1  1
    # 6  0  1  1  0
    

    You could also split and unlist to get distinct values, and then look them up in the original vector:

    unique_v <- unique(unlist(strsplit(Details, ",")))
    map_dfc(unique_v, ~as.numeric(grepl(.x, Details)))
    # # A tibble: 6 x 4
    #      V1    V2    V3    V4
    #   <dbl> <dbl> <dbl> <dbl>
    # 1     1     1     0     0
    # 2     1     0     1     0
    # 3     1     0     0     0
    # 4     0     1     0     0
    # 5     0     0     1     1
    # 6     0     1     1     0
    

    We could do some dirty string evaluation also if you know the number of columns:

    m <- as.data.frame(matrix(0,ncol=4,nrow=6))
    eval(parse(text=paste0("m[",ID,", c(",gsub("V","",Details),")] <- 1")))
    #   V1 V2 V3 V4
    # 1  1  1  0  0
    # 2  1  0  1  0
    # 3  1  0  0  0
    # 4  0  1  0  0
    # 5  0  0  1  1
    # 6  0  1  1  0