Search code examples
pandaspython-re

re.search is not returning complete match


I have dataframe with multiple columns of text. In one column with text I try to search for a substring with wildcards in a long string and put the result in a new column. The problem is that I manage to find a result, but the match is not returning the complete string.

For example I have this dataframe.

import pandas as pd
import re

df = pd.DataFrame(
    {
        'Col1': ['Some data 1', 'Some data 2', 'Some data 3'],
        'Col2': ['More data 1', 'More data 2', 'More data 3'],
        'Text': ['ID:12345;Description: This is a long piece of text containting some information I want to search for;Status=off;',
                    'Description: This is a another long piece of text containting some different information I want to search for;ID:abcde;Status=on;',
                    'Status=unknown;ID:abcde;Description: And this is a third piece of long piece of text which I want to search for;']
    }
)

What I tried is this:

df['Description'] = df['Text'].apply(lambda x: re.search('Description: (.*?);',x))

This results in the following dataframe:

Col1    Col2    Text    Description
0   Some data 1 More data 1 ID:12345;Description: This is a long piece of text containting some information I want to search for;Status=off;    <re.Match object; span=(9, 101), match='Description: This is a long piece of text contain>
1   Some data 2 More data 2 Description: This is a another long piece of text containting some different information I want to search for;ID:abcde;Status=on;   <re.Match object; span=(0, 110), match='Description: This is a another long piece of text>
2   Some data 3 More data 3 Status=unknown;ID:abcde;Description: And this is a third piece of long piece of text which I want to search for;    <re.Match object; span=(24, 112), match='Description: And this is a third piece of long pi>

It seems like the re.Match object is cutting of the result since I am expecting a longer match. Can anybody explain what I am doing wrong?


Solution

  • You're not doing anything wrong.

    You are seeing the "repr" of the match object, which doesn't display the full data.

    text = 'ID:12345;Description: This is a long piece of text containting some information I want to search for;Status=off;'
    
    print(re.search(r'Description: (.*?);', text))
    
    # <re.Match object; span=(9, 101), match='Description: This is a long piece of text contain>
    

    If you access the actual data match e.g. with .group(1) you can see it's there:

    print(re.search(r'Description: (.*?);', text).group(1))
    
    # 'This is a long piece of text containting some information I want to search for'
    

    Similarly with pandas (note: pandas has re methods under .str)

    df['Text'].str.extract(r'Description: (.*?);')
    
    #                                                    0
    # 0  This is a long piece of text containting some ...
    # 1  This is a another long piece of text containti...
    # 2  And this is a third piece of long piece of tex...
    
    df['Text'].str.extract(r'Description: (.*?);').iloc[0]
    
    # 0    This is a long piece of text containting some ...
    # Name: 0, dtype: object
    

    The match is actually there:

    df['Text'].str.extract(r'Description: (.*?);').iloc[0].item()
    
    # 'This is a long piece of text containting some information I want to search for'
    

    To see everything at once - it could help to output the result with .to_csv(), or .to_json() e.g.

    print(df['Text'].str.extract(r'Description: (.*?);').to_json(indent=4))
    
    {
        "0":{
            "0":"This is a long piece of text containting some information I want to search for",
            "1":"This is a another long piece of text containting some different information I want to search for",
            "2":"And this is a third piece of long piece of text which I want to search for"
        }
    }