Search code examples
nlpnltkstanford-nlpspacygensim

Extract meaningful words from spaceless texts


I have not done much NLP, but have a need. For example for the string 'australiafreedomrally', I need to automatically extract meaningful words i.e., 'australia', 'freedom' and 'rally'.

Is there any python package that can do it? Thanks


Solution

  • Check out this thread, where among other things a package is mentioned which does this. Generally an approach with a predefined list of common words can get you far. Your question has an overlap with the task of Optical Character Recognition (OCR) Post Correction which you can find some pretrained models for, although the problem being that strongly shifted towards one issue (missing whitespace character) probably leads to it not performing too great.

    If you want to really get into this topic you could try to train a new model on this task, I can imagine that recent popular transformer models which use subtoken-level embeddings for unknown words could be trained to bring a decent performance on this task since there are models which go into a similar direction as grammar correction and sentence boundary correction. There are also some older, rule-based approach papers which call this problem "word boundary detection" or more specifcally "agglutination", check out e.g. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6351975/, but generally the amount of off-the-shelf solutions you find for that problems is quite low.