I am currently developing an automated Index-Generator for pdf-files in Java. The concept is pretty simple (right now): I iterate over every word in the pdf, compare with an ignore list (something like the 10000 most common words in that language) and then add it to a com.google.common.collect.HashMultimap
with the word as String and a HashSet of pages, the word occurs on.
This is working pretty fine, but still I am getting words in all different declination/conjugation forms as separate items in the index. I was thinking of just comparing a relative sub-string of those words, but for instance in the German language (which the program is intended for) with all its irregularities, there is very less benefit of this approach.
Any other ideas, libraries, regex's, whatsoever? Thanks in advance
The process of reducing words to their common root is called lemmatization. A lemmatizer will map words like eaten
, eats
and ate
to eat
.
I'm not experienced with German but different libraries to perform this task are available for English, for example Stanford CoreNLP, which is a full-fledged NLP library providing many other features as well. It may support German as well, but I'm not sure.
Otherwise, a Google search for "German lemmatizer" will provide enough results, I think.
You can also user a stemmer, which is a simpler version of lemmatization. A stemmer is usually a rule-based component and is able to reduce words to their common root, but the output word will not always be valid: for example the word engine
might be stemmed as engin
. If you require that the words are still valid after this operation, lemmatization will be a better solution, otherwise stemming might be better because it is way faster to execute.