Search code examples
hadoopwekasimilaritymahouttext-mining

What platform / tool / software / language should i use for text mining?


I am a begineer to the field of text mining . I need to perform work on the document similarity .I aim at comparing two documents and then providing the similarity between them in terms of a number. I have read a lot of theory about this . I am planning to start with the cosine similarity

Can any of you help me with these basics questions : 1. What platform ? (windows/linux) 2. What tool (People talk about weka / mahout / hadoop ) - i have no idea on what to use 3. What language ? Some questions might sound absurd , but i have to start from scratch and i need some help


Solution

  • For software, I highly recommend RapidMiner, which you can grab from http://rapid-i.com. Some quick pros:

    • Open source and implemented in Java (works on any platform)
    • Intuitive graphical "operator pipeline" for hundreds of data mining tasks
    • Excellent text mining support. See this video tutorial

    In my experience data mining requires some real discipline to achieve desirable results. RapidMiner should help.