Search code examples
pythonpandaspython-re

Extract part of a text and split into two columns


I am trying to extract some part of the following sentences (I have similar rows following similar pattern):

Text
19 hours ago — Catch up on key developments an...
8 hour ago — Catch up on key developments an...
10 minutes ago — Catch up on key developments an...
1 day ago — Catch up on key developments an...

I would like to split the Text column into two. (before and after the —) :

Text1          Text 2
19 hours ago   Catch up on key developments an...
8 hour ago     Catch up on key developments an...
10 minutes ago Catch up on key developments an...
1 day ago      Catch up on key developments an...

I did this:

df[['Text1', 'Text2']] = df['Text'].str.extract(r"(\d+ \w+, \d{5})?\s*\—?\s*(.*)", expand=True)

However it seems not working. If you have experience with re, could you please point me to the mistake and to the solution? Thanks


Solution

  • You can use the pandas.Series.str.split function:

    df['Text'].str.split(' — ', n=1, expand=True)
    

    You need n=1 to limit the number of splits in output. Also, you need to set expand=True to use the expanding functionality.