I need to extract a name and a quote for a given text such as:
Homer Simpson said: "Okay, here we go..."
The returned values:
- extracted_person_name - The extracted person name, as appearing in the patterns explained above
- extracted_quotation - The extracted quoted text (withot the surrounding quotation marks).
- Important Note: if the pattern is not found, return None values for both the
extracted person name and the extracted text.
You could expect the input text to look similar to the following pattern:
Person name said: "quoted text"
Variations of the above pattern:
The colon punctuation mark (:) is optional, and and might not appear in the input sentence. Instead of the word said you could also expect the words:
answered, responded, replied
this is what I got so far:
def person_quotation_pattern_extraction(raw_text):
name_pattern = "\w+\s\w+"
quote_pattern = "["]\w+["]"
quote = re.search(quote_pattern, raw_text)
name = re.search(name_pattern, raw_text)
if re.search(quote_pattern,raw_text):
extracted_quotation = quote.group(0)
else:
extracted_quotation=None
if re.search(name_pattern,raw_text):
extracted_person_name = name.group(0)
else:
extracted_person_name=None
return extracted_person_name, extracted_quotation
problem is it returns Null
. I'm assuming the patterns are incorrect can you tell me what's wrong with them?
The first pattern is all right. It matches "Homer Simpson" as well as "here we" but since you only return group 0 this is fine.
The second pattern has some issues. Since you open the string with "
and use the same "
inside the string, python thinks the string ended there. You can observe this from the colors of the characters changing from green (strings) to black (not strings) back to green.
quote_pattern = "["]\w+["]"
You can prevent this by starting (and ending) your string with single quotation marks '
like this:
quote_pattern ='["]\w+["]'
However, this does still not match the provided quote. This is because \w
matches any word character (equivalent to [a-zA-Z0-9_]) but does not match the comma ,
, the points .
or the whitespaces
.
Therefore you could change the pattern to
quote_pattern ='["].*["]'
Where .*
matches anything.
You can further simplify the expression by removing the square brackes. They are not needed in this case since they contain only one element.
quote_pattern ='".*"'
You need to return the quote without the surrounding quotation marks. Therefore you can create a capure group in the expression using ()
:
quote_pattern ='"(.*)"'
This way the quotations marks are still needed to match but a group is created which does not contain them. This group is going to have index 1
instead of the 0
you use at the moment:
extracted_quotation = quote.group(1)
This should lead to the desired result.
Check out this website for some interactive regex action: https://regex101.com/