Search code examples
rdataframetransactionsnaarules

R (arules) Convert dataframe into transactions and remove NA


i have a set dataframe. My purpose is to convert the dataframe into transactions data in order to do market basket analysis using Arules package in R. I did do some research online regarding conversion of dataframe to transactions data, e.g.(How to prep transaction data into basket for arules) and (Transform csv into transactions for arules), but the result i got was different.

dput(df)

structure(list(Transaction_ID = c("A001", "A002", "A003", "A004", "A005", "A006"), 
Fruits = c(NA, "Apple", "Orange", NA, "Pear", "Grape"), 
Vegetables = c(NA, NA, NA, "Potato", NA, "Yam"), 
Personal = c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA), 
Drink = c("Coff", NA, "Coff", "Milk", "Milk", "Coff"), 
Other = c(NA, NA, NA, NA, "Promo", NA)), 
.Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

Below is my dataframe structure

Transaction_ID  Fruits  Vegetables  Personal  Drink  Other
      A001        NA        NA       ToothP   Coff    NA
      A002       Apple      NA       ToothP    NA     NA
      A003      Orange      NA         NA     Coff    NA
      A004        NA      Potato     ToothB   Milk    NA
      A005       Pear       NA       ToothB   Milk   Promo
      A006      Grape      Yam         NA     Coff    NA

class for each column

sapply(df, class)
Transaction_ID         Fruits     Vegetables       Personal          Drink          Other 
"character"    "character"    "character"    "character"    "character"    "character"

Convert dataframe to transaction data

data <- as(split(df[,"Fruits"], df[,"Vegetables"],df[,"Personal"], df[,"Drink"], df[,"Other"]), "transactions")
inspect(data)

Results i got

[1] {NA,NA,ToothP,Coff,NA}
[2] {Apple,NA,ToothP,NA,NA}
[3] {Orange,NA,NA,Coff,NA}
[4] {NA,Potato,ToothB,Milk,NA}
[5] {Pear,NA,ToothB,Milk,Promo}
[6] {Grape,Yam,NA,Coff,NA}

The transaction data was successfully converted, but I was wondering is there any way to remove the NA items? since the NA will take consideration as an item if they still remain in the transaction list.


Solution

  • Ogustari is right. Here is the complete code that also handles the transaction IDs.

    library("arules")
    library("dplyr")  ### for dbl_df
    df <- structure(list(Transaction_ID = c("A001", "A002", "A003", "A004", "A005", "A006"), 
      Fruits = c(NA, "Apple", "Orange", NA, "Pear", "Grape"), 
      Vegetables = c(NA, NA, NA, "Potato", NA, "Yam"), 
      Personal = c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA), 
      Drink = c("Coff", NA, "Coff", "Milk", "Milk", "Coff"), 
      Other = c(NA, NA, NA, NA, "Promo", NA)), 
      .Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
      class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
    
    ### remove transaction IDs
    tid <- as.character(df[["Transaction_ID"]])
    df <- df[,-1]
    
    ### make all columns factors
    for(i in 1:ncol(df)) df[[i]] <- as.factor(df[[i]])
    
    trans <- as(df, "transactions")
    
    ### set transactionIDs
    transactionInfo(trans)[["transactionID"]] <- tid
    
    inspect(trans)
    
       items                                          transactionID
    [1] {Personal=ToothP,Drink=Coff}                   A001         
    [2] {Personal=ToothP}                              A002         
    [3] {Drink=Coff}                                   A003         
    [4] {Vegetables=Potato,Personal=ToothB,Drink=Milk} A004         
    [5] {Personal=ToothB,Drink=Milk,Other=Promo}       A005         
    [6] {Vegetables=Yam,Drink=Coff}                    A006