Search code examples
pythonregexencodingutf-8isalpha

Does the isalpha() method in Python identify all non-alpha characters?


I have a file called messages.txt which consists of many sentences separated by line. I am attempt to exclude the lines that contain non-alpha characters (I only want those that include characters from A-Z.

import re
import string

lines = [line.rstrip() for line in open('messages.txt', encoding='utf-8')]

cleaned_lines = [s.replace("!", "").replace(".", "").replace("?", "").replace(",", "") for s in lines]

output_lines = []

for line in cleaned_lines:
  if line.replace(' ', '').isalpha() == True:
    output_lines.append(re.sub(r'\W+', '', line.lower()))

chars = sorted(set(('').join(output_lines)))
print(chars)

Output:

['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'ª', 'â', 'ã', 'å', 'ð', 'ÿ', 'œ', 'š', 'ž', 'ƒ', 'ˆ']

As it can be seen, it seems as if the isalpha() method is not excluding the strange

'â', 'ã', 'å', 'ð', 'ÿ'

characters. I have a feeling that this may be due to the encoding that the file is being read in, however, I would assume that the isalpha method in conjunction with the pattern RegEx should be able to filter out these characters.

Is this intentional? If so, what methods can be used to remove these strange characters?


Solution

  • Based on my local testing using a UTF-8 encoded Python script, isalpha() was returning false for inputs containing characters with accents:

    # -*- coding: utf-8 -*-
    inp1 = "Hello"
    inp2 = "Hållo"
    print(inp1.isalpha())  # True
    print(inp2.isalpha())  # False
    

    In any case, if you want to filter off any line containing a non ASCII alphanumeric character, then just use re.search in your initial list comprehension:

    lines = [line.rstrip() for line in open('messages.txt', encoding='utf-8') if not re.search(r'[^A-Za-z0-9]', line)]