Search code examples
pythonpython-2.7dictionarynltktext-segmentation

fixing words with spaces using a dictionary look up in python?


I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem

I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "

I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.

The final output should be "more recently the development, which is a potent "

I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.


Solution

  • Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:

     thequickbrownfoxjumpsoverthelazydog
    

    The most probable segmentation should be of course:

     the quick brown fox jumps over the lazy dog
    

    Here's an article including prototypical source code for the problem using Google Ngram corpus:

    The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:

    Example usage:

    $ python segmentation.py t hequi ckbrownfoxjum ped
    thequickbrownfoxjumped
    ['the', 'quick', 'brown', 'fox', 'jumped']
    

    Using data, even this can be reordered:

    $ python segmentation.py lmaoro fll olwt f pwned
    lmaorofllolwtfpwned
    ['lmao', 'rofl', 'lol', 'wtf', 'pwned']
    

    Note that the algorithm is quite slow - it's prototypical.

    Another approach using NLTK:

    As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.