I am a begineer to the field of text mining . I need to perform work on the document similarity .I aim at comparing two documents and then providing the similarity between them in terms of a number. I have read a lot of theory about this . I am planning to start with the cosine similarity
Can any of you help me with these basics questions : 1. What platform ? (windows/linux) 2. What tool (People talk about weka / mahout / hadoop ) - i have no idea on what to use 3. What language ? Some questions might sound absurd , but i have to start from scratch and i need some help
For software, I highly recommend RapidMiner, which you can grab from http://rapid-i.com. Some quick pros:
In my experience data mining requires some real discipline to achieve desirable results. RapidMiner should help.