Search code examples
nlpstemminglemmatization

Looking for a database or text file of english words with their different forms


I am working on a project and I need to get the root of a given word (stemming). As you know, the stemming algorithms that don't use a dictionary are not accurate. Also I tried the WordNet but it is not good for my project. I found phpmorphy project but it doesn't include API in Java.

At this time I am looking for a database or a text file of english words with their different forms. for example:

run running ran ... include including included ... ...

Thank you for your help or advise.


Solution

  • You could download LanguageTool (Disclaimer: I'm the maintainer), which comes with a binary file english.dict. The LanguageTool Wiki describes how to dump that file as a text file:

    java -jar morfologik-tools-1.6.0-standalone.jar fsa_dump -x -d english.dict
    

    For run, the file will contain this:

    ran run VBD
    run run NN
    run run VB
    run run VBN
    run run VBP
    running run VBG
    runs run NNS
    runs run VBZ
    

    The first column is the inflected form, the second is the base form, and the third is the part-of-speech tag according to the (slightly extended) Penn Treebank tagset.