I have a dataset with 2.6 million rows in which I have one column called msgText
, which contains written messages.
Now, I want to filter out all messages that don't contain any letters. To do so I found the following code:
dataset = dataset[dataset['msgText'].astype(str).str.contains('[A-Za-z]')]
However, after 16 hours the code is still running.
Furthermore, based on Does Python have a string 'contains' substring method? I thought about creating a list of length 26, that contains all the letters in the alphabet and then check whether each cell contains that letter. But that does not seem efficient either.
Therefore, I am wondering if there is a faster way to find whether a cell contains letters.
EDIT: The code above works pretty well. Apparently, what I had in my (slow) code was: dataset['msgText'] = dataset[dataset['msgText'].astype(str).str.contains('[A-Za-z]')]
import pandas
dataset['columnName'].apply(lambda x: x.find('\\w') > 0)