I am new to regular expressions and Python. For example, my string list is:
my_try = ['Aas','1Aasdf','cc)','ASD','.ASD','aaaa1','A']
Now, I want to delete all the strings with non-English letters. So, I just want to keep:
['Aas','ASD','A']
I do not know how to use ^ or something else to do this?
And, if my data is:
my_try=pd.DataFrame({'try':
['Aas','1Aasdf','cc)','A2SD','.ASD',
'aaaa1','A','123%']})
Then I use:
[x for x in my_try if re.match(r'^[a-zA-Z]+$', x['try'])]
Why do I have such error:
Traceback (most recent call last):
File "C:\feng\myCode\infoExtract\venv\lib\site-packages\IPython \core\interactiveshell.py", line 3319, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-58-4bd95f31bd0c>", line 1, in <module>
[x for x in my_try if re.match(r'^[a-zA-Z]+$', x['try'])]
File "<ipython-input-58-4bd95f31bd0c>", line 1, in <listcomp>
[x for x in my_try if re.match(r'^[a-zA-Z]+$', x['try'])]
TypeError: string indices must be integers
How can I fix this and why does it happen?
You have a list and want to filter it to only contain elements that match some condition, list comprehensions with an if
are perfect for that:
my_list = [1, 2, 3, 4, 5, 6]
# just even numbers:
print([x for x in my_list if x % 2 == 0])
And you want to filter for anything that consists of only letters 'a' through 'z' and 'A' through 'Z', which is where a regex is easy to use:
my_try = ['Aas','1Aasdf','cc)','ASD','.ASD','aaaa1','A']
print([x for x in my_try if re.match('^[a-zA-Z]+$', x)])
The regex starts with ^
and ends in $
to tell re.match()
that it should match the entire string, from start to end. [a-zA-Z]
defines a character class containing the letters you're after. Often you'd use \w
but that also includes numbers. And finally, the +
means there needs to be 1 or more of the characters in the string (as opposed to 0 or more if you use *
)