I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be "more recently the development, which is a potent "
I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.
Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:
thequickbrownfoxjumpsoverthelazydog
The most probable segmentation should be of course:
the quick brown fox jumps over the lazy dog
Here's an article including prototypical source code for the problem using Google Ngram corpus:
The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:
Example usage:
$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']
Using data, even this can be reordered:
$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
Note that the algorithm is quite slow - it's prototypical.
Another approach using NLTK:
As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.