I am using the following function to strip out non-ascii characters
def removeNonAscii(s):
return "".join(filter(lambda x: ord(x)<128, s))
def removeNonAscii1(s):
return "".join(i for i in s if ord(i)<128)
I would now like to remove the entire word if it contains any non-ascii characters. I thought of measuring the length pre and post function application but I am confident that there is a more efficient way. Any ideas?
If you define the word based on spaces, something like this might work:
def containsNonAscii(s):
return any(ord(i)>127 for i in s)
words = sentence.split()
cleaned_words = [word for word in words if not containsNonAscii(word)]
cleaned_sentence = ' '.join(cleaned_words)
Note that this will collapse repeated whitespace into just one space.