Search code examples
rnlptext-miningknime

Text mining - feature with a lot of spelling probs and differentations


I'd like to make sense of the feature "color". The problem is that it has more than 15.000 specifications with a lot of spelling problems (e.g brwon <-> brown, oliv <-> olive), but also differentations (lightblue <-> blue) in it.

How is it possible to make sense of such a feature? Are there any resources, R packages or python modules?


Solution

  • R can use aspell (command is available). But you need to install aspell on your machine and maybe even hunspell. Hunspell is used as a spellcheck in chrome / firefox and Rstudio for example.

    Read this journal for more information about aspell and hunspell in R.

    But this will only take care of spelling errors. You could use regex to look for main colors.