Search code examples
pythondata-sciencedata-cleaning

How to remove international alphanumeric chars in python?


I have opened my dataset as follows in python.

with open(page_titles.txt, encoding="utf8") as fg:
    all_concepts = []
    for line in fg:
        all_concepts.append(line)

However, my titles contain some internation alphanumeric chars such as Ռեթէոս_Պէրպէրեան, 丘, (جامعة_جورجتاون_(قطر, (جامعة_جورجتاون_(قطر, (കേരള_നിയമസഭ).

I only want to keep titles in English Language.

I tried to do the following. However, it does not solve my problem as it says that the above mentioned titles are as valid titles.

def remove_non_ascii(text):
    non_ascii = 0
    ascii_letter = 0
    for c in text:
        if 0 <= ord(c) <= 127:
            # this is a ascii character.
            ascii_letter = ascii_letter + 1
        else:
            # this is a non-ascii character. Do something.
            non_ascii = non_ascii + 1

    if len(text)==non_ascii:
        print("invalid")
    else:
        print("valid")

Please help me.


Solution

  • Your code currently excludes only strings that consist entirely of non-ASCII characters. However, all the example strings you've shown contain the underscore character, which is an ASCII character, and so makes the name valid according to your current code.

    If that's not the result you want, you need to change how your code works. For instance, you could reject any string with any non-ASCII characters (rather than only those that are all non-ASCII). Just change if len(text) == non_ascii to if non_ascii > 0.

    But I'd caution you that excluding all strings with non-ASCII characters may be a bad idea. Lots of English-language words (such as café) and names (such as Zoë) contain non-ASCII characters (at least in some spellings). It may be a better idea to support non-ASCII titles in your program, and fix whatever other issues they cause in other places (e.g. by properly encoding your inputs and outputs). If the non-ASCII titles are undesirable for other reasons (e.g. they describe things that are not in English) then you should filter them out on that other criterion (e.g. the language of the contents) rather than on the kinds of letters in the title.