Search code examples
pythonregexpandasdata-munging

Regular Expressions not working with Pandas Dataframe


I've got a Pandas Dataframe that's made up of emails that I'm needing to clean using regex. However, my attempts to clean the column, aren't actually being applied to the text.

Example data is below:

|subeject         | description       |
---------------------------------------
|change email     | 'Hi, I'm trying...|
|how are you?     | 'Hi, how are...   |

The actual dataset has about 2500 rows.

The example code that I'm using is:

data = pd.read_csv('file.csv', names=['subject', 'description'])
data['description'] = data['description'].str.lower().str.split()

# Text cleaning below:
data['description'] = data['description'].replace(r'<(.*?)\>', '')
data['description'] = data['description'].replace(r'www[a-z]+', '')
... # more regex

Running this code in an iPython notebook using Python 2.7 I would expect the regex to identify statements and replace it with a space.

However, when running it, the text of the description does not change.

An alternative method I've tried with the same result is as follows:

for i in data['description']:
    re.sub(r'<(.*?)\>', '', i)
    re.sub(r'www[a-z]+', '', i)

However, I got the same result with none of the text being removed.

Could you please advise or point me in the right direction?


Solution

  • The syntax for text cleaning should be:

    data['description'] = data['description'].str.replace(r'www[a-z]+', '')