I want to cluster text. I kinda understand the concept of clustering text-only content from Mahout in Action:
I want to cluster text as well as other information like date-time, location, people I was with. For example, I want documents made in a 10-day visit to a distant place to be placed into a distinct cluster.
I know I must write my own tool for making vectors from date-time, location, tags and (natural) text. How do I approach this? Should I use built-in tools to vectorize text and then integrate that output to my own vectors? What about weighing the dimensions?
I cant give you full implementation details, as im not sure, but i can help you out with a piece of the puzzle. You will almost certainly need some context analysis to extract entities (such as location, time/date, person names)
For this take a look at OpenNLP.
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html
in particular look at POS tagger, and namefinder.
Once you have extracted out the relevant entities, - you 'may' be able to do something with them using Mahout classification, (once you have extracted enough entities to train your model), but this i am not sure.
good luck