Search code examples
pythonpython-3.xsplitspace

Fix the words and remove the unwanted spaces between splitted word using python?


I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem

I have sentences such as Java is a prog rammng lan guage. C is a gen eral purpose la nguage.

I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.

The final output should be Java is a programmng language. C is a general purpose language.

I need help with some pointers to look for such approaches. How to solve the above problem?

I want to solve the above problem using python code. Thanks.


Solution

  • Here's a simple script that works for your example. Obviously you'd want a bigger corpus of valid words. Also, you'd probably want to have an elif branch that looked back at the previous word if joining the next word failed to fix a non-word.

    from string import punctuation
    
    word_list = "big list of words including a programming language is general purpose"
    valid_words = set(word_list.split())
    
    bad = "Java is a prog ramming lan guage. C is a gen eral purpose la nguage."
    words = bad.split()
    
    out_words = []
    i = 0
    while i < len(words):
        word = words[i]
        if word not in valid_words and i+1 < len(words):
            next_word = words[i+1]
            joined = word + next_word
            if joined.strip(punctuation) in valid_words:
                word = joined
                i += 1
        out_words.append(word)
        i += 1
    
    good = " ".join(out_words)
    print(good)