Search code examples
pythonpython-unicode

Removing Words that contain non-ascii characters using Python


I am using the following function to strip out non-ascii characters

def removeNonAscii(s): 
    return "".join(filter(lambda x: ord(x)<128, s))

def removeNonAscii1(s): 
    return "".join(i for i in s if ord(i)<128)

I would now like to remove the entire word if it contains any non-ascii characters. I thought of measuring the length pre and post function application but I am confident that there is a more efficient way. Any ideas?


Solution

  • If you define the word based on spaces, something like this might work:

    def containsNonAscii(s):
        return any(ord(i)>127 for i in s)
    
    words = sentence.split()
    cleaned_words = [word for word in words if  not containsNonAscii(word)]
    cleaned_sentence = ' '.join(cleaned_words)
    

    Note that this will collapse repeated whitespace into just one space.