I have dataframe with multiple columns of text. In one column with text I try to search for a substring with wildcards in a long string and put the result in a new column. The problem is that I manage to find a result, but the match is not returning the complete string.
For example I have this dataframe.
import pandas as pd
import re
df = pd.DataFrame(
{
'Col1': ['Some data 1', 'Some data 2', 'Some data 3'],
'Col2': ['More data 1', 'More data 2', 'More data 3'],
'Text': ['ID:12345;Description: This is a long piece of text containting some information I want to search for;Status=off;',
'Description: This is a another long piece of text containting some different information I want to search for;ID:abcde;Status=on;',
'Status=unknown;ID:abcde;Description: And this is a third piece of long piece of text which I want to search for;']
}
)
What I tried is this:
df['Description'] = df['Text'].apply(lambda x: re.search('Description: (.*?);',x))
This results in the following dataframe:
Col1 Col2 Text Description
0 Some data 1 More data 1 ID:12345;Description: This is a long piece of text containting some information I want to search for;Status=off; <re.Match object; span=(9, 101), match='Description: This is a long piece of text contain>
1 Some data 2 More data 2 Description: This is a another long piece of text containting some different information I want to search for;ID:abcde;Status=on; <re.Match object; span=(0, 110), match='Description: This is a another long piece of text>
2 Some data 3 More data 3 Status=unknown;ID:abcde;Description: And this is a third piece of long piece of text which I want to search for; <re.Match object; span=(24, 112), match='Description: And this is a third piece of long pi>
It seems like the re.Match object is cutting of the result since I am expecting a longer match. Can anybody explain what I am doing wrong?
You're not doing anything wrong.
You are seeing the "repr" of the match object, which doesn't display the full data.
text = 'ID:12345;Description: This is a long piece of text containting some information I want to search for;Status=off;'
print(re.search(r'Description: (.*?);', text))
# <re.Match object; span=(9, 101), match='Description: This is a long piece of text contain>
If you access the actual data match e.g. with .group(1)
you can see it's there:
print(re.search(r'Description: (.*?);', text).group(1))
# 'This is a long piece of text containting some information I want to search for'
Similarly with pandas (note: pandas has re methods under .str
)
df['Text'].str.extract(r'Description: (.*?);')
# 0
# 0 This is a long piece of text containting some ...
# 1 This is a another long piece of text containti...
# 2 And this is a third piece of long piece of tex...
df['Text'].str.extract(r'Description: (.*?);').iloc[0]
# 0 This is a long piece of text containting some ...
# Name: 0, dtype: object
The match is actually there:
df['Text'].str.extract(r'Description: (.*?);').iloc[0].item()
# 'This is a long piece of text containting some information I want to search for'
To see everything at once - it could help to output the result with .to_csv()
, or .to_json()
e.g.
print(df['Text'].str.extract(r'Description: (.*?);').to_json(indent=4))
{
"0":{
"0":"This is a long piece of text containting some information I want to search for",
"1":"This is a another long piece of text containting some different information I want to search for",
"2":"And this is a third piece of long piece of text which I want to search for"
}
}