Search code examples
bigdataregressionrapidminerknimeelki

Predicting a numeric attribute through high dimensional nominal attributes


I'm having difficulties mining a big (100K entries) dataset of mine concerning logistics transportation. I have around 10 nominal String attributes (i.e. city/region/country names, customers/vessel identification codes, etc.). Along with those, I have one date attribute "departure" and one ratio-scaled numeric attribute "goal".

What I'm trying to do is using a training set to find out which attributes have strong correlations with "goal" and then validating these patterns by predicting the "goal" value of entries in a test set.

I assume clustering, classification and neural networks could be useful for this problem, so I used RapidMiner, Knime and elki and tried to apply some of their tools on my data. However, most of these tools only handle numeric data, so I got no useful results.

Is it possible to transform my nominal attributes into numeric ones? Or do I need to find different algorithms that can actually handle nominal data?


Solution

  • you most likely want to use tree based algorithm. These are good to use nominal features. Please be aware, that you do not want to use "id-like" attributes.

    I would recommend RapidMiner's AutoModel feature as a start. GBT and RandomForest should work well.

    Best, Martin