I have sentences in the following form. I want to extract all numeric values occurring after any given token. For example, I want to extract all numeric values after the phrase "tangible net worth"
Example sentences:
From both of these sentences, I want to extract "$100000000"
and "$50000000"
and create a dictionary like this:
{
"tangible net worth": "$100000000"
}
I am unsure how to use the re
python module to achieve this. Also, one needs to be careful here, a significant portion of sentences contain multiple numeric values. So, I want only to extract the immediate value occurring after the match. I have tried the following expressions, but none of them are giving desired results
re.search(r'net worth.*(\d+)', sent)
re.search(r'(net worth)(.*)(\d+)', sent)
re.search(r'(net worth)(.*)(\d?)', sent)
re.findall(r'tangible net worth (.*)?(\d* )', sent)
re.findall(r'tangible net worth (.*)?( \d* )', sent)
re.findall(r'tangible net worth (.*)?(\d)', sent)
A little help with the regular expression will be highly appreciated. Thanks.
You could use this regex:
tangible net worth\D*(\d+)
which will skip any non-digit characters after tangible net worth
before capturing the first digits that occur after it.
You can then place the result into a dict. Note I would recommend storing a number rather than a string as you can always format it on output (adding $
, comma thousands separators etc).
strs = [
"A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5",
"Minimum required tangible net worth the firm needs to maintain is $50000000"
]
result = []
for sent in strs:
m = re.findall(r'tangible net worth\D*(\d+)', sent)
if m:
result += [{ 'tangible net worth' : int(m[0]) }]
print(result)
Output:
[
{'tangible net worth': 100000000},
{'tangible net worth': 50000000}
]