Search code examples
rsequencesarules

Capturing all items in basket (R/arulesSequences)


I am having an issue with the arulesSequences package in R. I was able to read baskets into the program, and create a data.frame, however it fails to recognize any other items beyond the first column. Below is a sample of my data set, which follows the form demonstrated here: Data Mining Algorithms in R/Sequence Mining/SPADE.

    [sequenceID] [eventID] [SIZE] items
    2 1 1 OB/Gyn
    15 1 1 Internal_Medicine
    15 2 1 Internal_Medicine
    15 3 1 Internal_Medicine
    56 1 2 Internal_Medicine Neurology
    84 1 1 Oncology
    151 1 2 Hematology Hematology
    151 2 1 Hematology/Oncology
    151 3 1 Hematology/Oncology
    185 1 2 Gastroenterology Gastroenterology

The dataset was exported from SAS as a [.CSV] then converted to a tab-delimited [.TXT] file in Excel. Headers were removed for import into R, but I placed them in brackets above for clarity in this example. All spaces were replaced with an underscore ("_"), and item names were simplified as much as possible. Each item is listed in a separate column. The following command was used to import the file:

    baskets <- read_baskets(con = "...filepath/spade.txt", sep = "[ \t]+",info=c("sequenceID", "eventID", "SIZE"))

I am presented with no errors, so I continue with the following command:

    as(baskets, "data.frame")

Here, it returns the data.frame as requested, however it fails to capture the items beyond the first column:

    items sequenceID eventID SIZE
    {OB/Gyn} 2 1 1
    {Internal_Medicine} 15 1 1
    {Internal_Medicine} 15 2 1
    {Internal_Medicine} 15 3 1
    {Internal_Medicine} 56 1 2
    {Oncology} 84 1 1
    {Hematology} 151 1 2
    {Hematology/Oncology} 151 2 1
    {Hematology/Oncology} 151 3 1
    {Gastroenterology} 185 1 2

Line 5 should look like:

    {Internal_Medicine, Neurology} 56 1 2

I have tried importing the file directly as a [.CSV], but the data.frame results in a similar format to my above attempt using tabs, except it places a comma in front of the first item:

    {,Internal_Medicine} 56 1 2

Any troubleshooting suggestions would be greatly appreciated. It seems like this package is picky when it comes to formatting.


Solution

  • Line 5 should look like:

    {Internal_Medicine, Neurology} 56 1 2
    

    Check out

    library(arulesSequences)
    packageVersion("arulesSequences")
    # [1] ‘0.2.16’
    packageVersion("arules")
    # [1] ‘1.5.0’
    txt <- readLines(n=10)
    2 1 1 OB/Gyn
    15 1 1 Internal_Medicine
    15 2 1 Internal_Medicine
    15 3 1 Internal_Medicine
    56 1 2 Internal_Medicine Neurology
    84 1 1 Oncology
    151 1 2 Hematology Hematology
    151 2 1 Hematology/Oncology
    151 3 1 Hematology/Oncology
    185 1 2 Gastroenterology Gastroenterology
    writeLines(txt, tf<-tempfile())
    baskets <- read_baskets(con = tf, sep = "[ \t]+",info=c("sequenceID", "eventID", "SIZE"))
    as(baskets, "data.frame")
    #                            items sequenceID eventID SIZE
    # 1                       {OB/Gyn}          2       1    1
    # 2            {Internal_Medicine}         15       1    1
    # 3            {Internal_Medicine}         15       2    1
    # 4            {Internal_Medicine}         15       3    1
    # 5  {Internal_Medicine,Neurology}         56       1    2 # <----------
    # 6                     {Oncology}         84       1    1
    # 7                   {Hematology}        151       1    2
    # 8          {Hematology/Oncology}        151       2    1
    # 9          {Hematology/Oncology}        151       3    1
    # 10            {Gastroenterology}        185       1    2