I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as Java is a prog rammng lan guage. C is a gen eral purpose la nguage.
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be Java is a programmng language. C is a general purpose language.
I need help with some pointers to look for such approaches. How to solve the above problem?
I want to solve the above problem using python code. Thanks.
Here's a simple script that works for your example. Obviously you'd want a bigger corpus of valid words. Also, you'd probably want to have an elif
branch that looked back at the previous word if joining the next word failed to fix a non-word.
from string import punctuation
word_list = "big list of words including a programming language is general purpose"
valid_words = set(word_list.split())
bad = "Java is a prog ramming lan guage. C is a gen eral purpose la nguage."
words = bad.split()
out_words = []
i = 0
while i < len(words):
word = words[i]
if word not in valid_words and i+1 < len(words):
next_word = words[i+1]
joined = word + next_word
if joined.strip(punctuation) in valid_words:
word = joined
i += 1
out_words.append(word)
i += 1
good = " ".join(out_words)
print(good)