Search code examples
rdataframeapplydata-extraction

Convert data frame of values into binary data frame where each unique value is a column


I'm reading in data from a csv file where each row contains some number of individual strings:

e.g.

data.csv ->
    x,f,t,h,b,g
    d,g,h
    g,h,a,s,d
    f
    q,w,e,r,t,y,u,i,o

data <- read.csv("data.csv", header = FALSE)

I want to transform this input into a data frame where the columns are the set of unique strings present in the input. In this case, the columns would be the set of strings {x,f,t,h,b,g,d,a,s,q,w,e,r,y,u,i,o}. Additionally, the new data frame should contain a row for each row in the input data frame such that a column will have the value 1 if the column's name was present in that row in the input data frame, or 0 if the column's name was not present in that input row.

In this example, the desired output would be the following:

    x f t h b g d a s q w e r y u i o
   ----------------------------------
1 | 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
2 | 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
3 | 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
4 | 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5 | 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1

The code below is what I currently have. However, the output df ends up being a data frame with what appear to be the correct columns, but 0 rows.

I'm very inexperienced at R, and this was my attempt at putting together something that works. It seems to work as expected up until the call to apply(), which unexpectedly doesn't add anything to df.

data <- read.csv("data.csv", header = FALSE)

columnNames = c()
for (row in data) {
  for (eventName in row) {
    if (!(eventName %in% columnNames)) {
      columnNames = c(columnNames, eventName)
    }
  }
}
columnNames = t(columnNames)

df = data.frame(columnNames)
colnames(df) = columnNames
df = df[-1,]

apply(data, 1, function(row, df) {
  dat = data.frame(columnNames)
  colnames(dat) = columnNames
  dat = dat[-1,]

  for (eventName in row) {
    if (eventName != "") {
      dat[1,eventName] = 1
    }
  }

  df = rbind(df, dat)
}, df)

After the script finishes it tells me there were many warnings of the following two forms:

9: In `[<-.factor`(`*tmp*`, iseq, value = 1) : invalid factor level, NA generated
10: In `[<-.factor`(`*tmp*`, iseq, value = 1) :
  invalid factor level, NA generated

Solution

  • We can use mtabulate after splitting the column by ,

    library(qdapTools)
    mtabulate(strsplit(as.character(df1[,1]), ","))
    

    Or with base R methods by splitting the column by ,, set the names of the list output as the sequence of rows, convert the list to data.frame (stack), change the 'values' column to factor with levels specified and then get the frequency with table.

    table(transform(stack(setNames(strsplit(as.character(df1[,1]), ","), 1:nrow(df1)))[2:1], 
               values = factor(values, levels = unique(values))))
    # 
    #    x f t h b g d a s q w e r y u i o
    #  1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
    #  2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
    #  3 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
    #  4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    #  5 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
    

    Update

    If this is not a single column,

    mtabulate(apply(df2, 1, FUN = function(x) x[x!=""]))
    

    Or

    as.data.frame.matrix(table(transform(stack(setNames(apply(df2, 1, 
         FUN = function(x) x[x!=""]), 
               1:nrow(df2)))[2:1],  values = factor(values, levels = unique(values)))))
    #    
    #    x f t h b g d a s q w e r y u i o
    #  1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
    #  2 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0
    #  3 0 0 0 1 0 1 1 1 1 0 0 0 0 0 0 0 0
    #  4 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    #  5 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1
    

    data

    df1 <- structure(list(V1 = c("x,f,t,h,b,g", "d,g,h", "g,h,a,s,d", "f", 
    "q,w,e,r,t,y,u,i,o")), .Names = "V1", class = "data.frame", 
    row.names = c(NA, -5L))
    
    df2 <- structure(list(v1 = c("x", "d", "g", "f", "q"), v2 = c("f", "g", 
    "h", "", "w"), v3 = c("t", "h", "a", "", "e"), v4 = c("h", "", 
    "s", "", "r"), v5 = c("b", "", "d", "", "t"), v6 = c("g", "", 
    "", "", "y"), v7 = c("", "", "", "", "u"), v8 = c("", "", "", 
    "", "i"), v9 = c("", "", "", "", "o")), .Names = c("v1", "v2", 
    "v3", "v4", "v5", "v6", "v7", "v8", "v9"), row.names = c(NA, 
    -5L), class = "data.frame")