I am trying to analyze tourism data which looks like this:
@DATA
2013-1-01,01,1,0,1,3,3,329.2172000000005
2013-1-01,01,1,0,1,3,4,1399.7826299999915
2013-1-01,01,1,1,2,3,2,10.50964
Where the last attribute is the number of travellers who fulfilled all the other conditions (hotel, specific city, specific number of nights...)
I am trying to create clusters of tourists to segment the data and get meaningful insights, and I am kinda new to machine learning so I am struggling a bit here. After some research, as I dont know how in how many clusters should the data be splitted, I saw that one good approach is to use self organized maps to get the number of clusters and then something like K-means or EV. So I am using WEKA and I applied SOM to the data, but it looks like it forms the clusters grouping by all attributes, including the last one, instead of using it for weighting.
One possible solution I thought is to create a row of data for each unit in the frequency attribute, but that would make the file too big. Any ideas?
Most implementations do not support weighting. It would be possible to do this, but you will need to change the code.
As your last column isn't integer, you can't just repeat rows.
What is wrong with considering each row to be a cluster?
But your other attributes seem to be some kind of categories. Such data tends to cluster really badly. There can be 1 difference, 2 differences, all different. That is too coarse for meaningful clustering.
You also have a timestamp, so you probably are interested in change over time?