Search code examples
csvdata-miningorange

How to create features columns based on values across diff columns


Hi I am trying to do one hot encoding in Orange in order to conduct market basket analysis.

Currently I have transaction data as follows in my CSV:

C# Items
C1 Apple Orange
C2 Baby Milk Apple Orange

I would like to find out what are the steps that I can do to process the data in orange or other software such that I am able to get this state for my data

C# Apple Orange Baby Milk
C1 1 1 0
C2 1 1 1

Currently when I try to preprocess the data in orange using "continous discrete variables - one feature per line" I get individual feature value columns.

enter image description here


Solution

  • It is not entirely straightforward, but you could concatenate your products with comma or semicolon, pass it to Corpus, apply tokenization based on your concatenation character (comma, semicolon) with a Regex, then use Bag of Words from the Text add-on. I have tried it with Associate add-on, and it seems to work.