Search code examples
pythontextunicodefilterascii

How can I remove non-ASCII characters but leave periods and spaces?


I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code:

def onlyascii(char):
    if ord(char) < 48 or ord(char) > 127: return ''
    else: return char

def get_my_string(file_path):
    f=open(file_path,'r')
    data=f.read()
    f.close()
    filtered_data=filter(onlyascii, data)
    filtered_data = filtered_data.lower()
    return filtered_data

How should I modify onlyascii() to leave spaces and periods? I imagine it's not too complicated but I can't figure it out.


Solution

  • You can filter all characters from the string that are not printable using string.printable, like this:

    >>> s = "some\x00string. with\x15 funny characters"
    >>> import string
    >>> printable = set(string.printable)
    >>> filter(lambda x: x in printable, s)
    'somestring. with funny characters'
    

    string.printable on my machine contains:

    0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
    !"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c
    

    EDIT: On Python 3, filter will return an iterable. The correct way to obtain a string back would be:

    ''.join(filter(lambda x: x in printable, s))