Search code examples
rtidyodk

Separating untidy data in R - ODK cleanup


To tidy part of a dataset I need to separate one column into several. These data takes the form similar to this:

set.seed(2133)
df <- data.frame(a = paste(sample(1:9, 5, replace=T), sample(c("", "%2", "%3"), 5, replace=T),  sample(c("", "%3", "%4"), 5, replace=T), sep="")) 
df
      a
1     6
2 2%3%4
3   6%2
4   3%2
5 5%2%4

Tidyr's separate function wont do the job and the best idea I have is a series of ifelse statements, like this:

df$One <- ifelse(grepl("1", df$a) == T, 1, 0)
df$Two <- ifelse(grepl("2", df$a) == T, 1, 0)
          a One Two
1     6   0   0
2 2%3%4   0   1
3   6%2   0   1
4   3%2   0   1
5 5%2%4   0   1

What's the best way to go about such tidying. I'm sure many people working with Open Data Kit (ODK) for data collection will come across this.


Solution

  • We can loop the patterns (i.e. 1, 2) with sapply, get a logical matrix, coerce to binary by wrapping with + and assign it as new columns in 'df'

    df[c("One", "Two")] <- +(sapply(1:2, grepl, df$a))
    df
    #     a One Two
    #1     6   0   0
    #2 2%3%4   0   1
    #3   6%2   0   1
    #4   3%2   0   1
    #5 5%2%4   0   1
    

    If this is to get the binary output for each unique value in 'a' column, we can split the strings in 'a' by the delimiter (%), the output list can be converted to data.frame using stack, transform the 'values' column to factor class by specifying the levels and do the table

    table(transform(stack(setNames(strsplit(as.character(df$a), "[%]"),
           1:nrow(df))), values = factor(values, levels= 1:6))[2:1])
    

    Or we can use a convenient function mtabulate from qdapTools after splitting.

    library(qdapTools)
    mtabulate(strsplit(as.character(df$a), "[%]"))