python python-2.7 dictionary nltk text-segmentation

fixing words with spaces using a dictionary look up in python?

I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem

I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "

I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.

The final output should be "more recently the development, which is a potent "

I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.

Solution

Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:

 thequickbrownfoxjumpsoverthelazydog

The most probable segmentation should be of course:

 the quick brown fox jumps over the lazy dog

Here's an article including prototypical source code for the problem using Google Ngram corpus:

http://jeremykun.com/2012/01/15/word-segmentation/

The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:

https://gist.github.com/miku/7279824

Example usage:

$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']

Using data, even this can be reordered:

$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']

Note that the algorithm is quite slow - it's prototypical.

Another approach using NLTK:

http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/

As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.