Search code examples
rcsvaprioriarules

Multiple line cells from csv to R transaction matrix


I've got the following problem (which is actually two problems): I have a csv-file with transactions. But all the items bought with a transactionID are stored in multiple lines of a single cell.

It looks like this

TransactionID    Items

1234             Milk
                 Butter
                 Bread

2345             Milk
                 Bread

3456             Beer
                 Milk

4567             Beer
                 Butter

As you can see not all items are used in each transaction.

How can I import my data in R as a transaction matrix that looks like this

TransactionID    Milk    Butter    Bread    Beer
1234             1       1         1        0
2345             1       0         1        0
3456             1       0         0        1
4567             0       1         0        1

Can it be done in a single, elegant step? After the import I want to analyze my data using the arules package.

Thanks in advance!


Solution

  • This is not single line and assumes that words are split by blank space. I find the unique words first then do a double loop.

    u <- unique(do.call('c', strsplit(df$items, ' ')))
    for (i in 1:nrow(df)) {
      for (j in u) {
        df[i, j] <- 1 * (j %in% strsplit(df$items[i], ' ')[[1]])
      }
    }