Search code examples
pythonregexstringdataframeletter

How to delete some strings with non-English letters?


I am new to regular expressions and Python. For example, my string list is:

my_try = ['Aas','1Aasdf','cc)','ASD','.ASD','aaaa1','A']

Now, I want to delete all the strings with non-English letters. So, I just want to keep:

['Aas','ASD','A']

I do not know how to use ^ or something else to do this?

And, if my data is:

my_try=pd.DataFrame({'try':
                         ['Aas','1Aasdf','cc)','A2SD','.ASD',
                          'aaaa1','A','123%']})

Then I use:

[x for x in my_try if re.match(r'^[a-zA-Z]+$', x['try'])]

Why do I have such error:

Traceback (most recent call last):
  File "C:\feng\myCode\infoExtract\venv\lib\site-packages\IPython    \core\interactiveshell.py", line 3319, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-58-4bd95f31bd0c>", line 1, in <module>
    [x for x in my_try if re.match(r'^[a-zA-Z]+$', x['try'])]
  File "<ipython-input-58-4bd95f31bd0c>", line 1, in <listcomp>
    [x for x in my_try if re.match(r'^[a-zA-Z]+$', x['try'])]
 TypeError: string indices must be integers

How can I fix this and why does it happen?


Solution

  • You have a list and want to filter it to only contain elements that match some condition, list comprehensions with an if are perfect for that:

    my_list = [1, 2, 3, 4, 5, 6]
    # just even numbers:
    print([x for x in my_list if x % 2 == 0])
    

    And you want to filter for anything that consists of only letters 'a' through 'z' and 'A' through 'Z', which is where a regex is easy to use:

    my_try = ['Aas','1Aasdf','cc)','ASD','.ASD','aaaa1','A']
    print([x for x in my_try if re.match('^[a-zA-Z]+$', x)])
    

    The regex starts with ^ and ends in $ to tell re.match() that it should match the entire string, from start to end. [a-zA-Z] defines a character class containing the letters you're after. Often you'd use \w but that also includes numbers. And finally, the + means there needs to be 1 or more of the characters in the string (as opposed to 0 or more if you use *)