Search code examples
python-3.xregexpython-re

Extracting first numerical value occuring after some token in text in python


I have sentences in the following form. I want to extract all numeric values occurring after any given token. For example, I want to extract all numeric values after the phrase "tangible net worth"

Example sentences:

  1. "A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5"
  2. "Minimum required tangible net worth the firm needs to maintain is $50000000".

From both of these sentences, I want to extract "$100000000" and "$50000000" and create a dictionary like this:

{
    "tangible net worth": "$100000000"
}

I am unsure how to use the re python module to achieve this. Also, one needs to be careful here, a significant portion of sentences contain multiple numeric values. So, I want only to extract the immediate value occurring after the match. I have tried the following expressions, but none of them are giving desired results

re.search(r'net worth.*(\d+)', sent)
re.search(r'(net worth)(.*)(\d+)', sent)
re.search(r'(net worth)(.*)(\d?)', sent)
re.findall(r'tangible net worth (.*)?(\d* )', sent)
re.findall(r'tangible net worth (.*)?( \d* )', sent)
re.findall(r'tangible net worth (.*)?(\d)', sent)

A little help with the regular expression will be highly appreciated. Thanks.


Solution

  • You could use this regex:

    tangible net worth\D*(\d+)
    

    which will skip any non-digit characters after tangible net worth before capturing the first digits that occur after it.

    You can then place the result into a dict. Note I would recommend storing a number rather than a string as you can always format it on output (adding $, comma thousands separators etc).

    strs = [
        "A company must maintain a minimum tangible net worth of $100000000 and leverage ratio of 0.5",
        "Minimum required tangible net worth the firm needs to maintain is $50000000"
    ]
    
    result = []
    for sent in strs:
        m = re.findall(r'tangible net worth\D*(\d+)', sent)
        if m:
            result += [{ 'tangible net worth' : int(m[0]) }]
    
    print(result)
    

    Output:

    [
     {'tangible net worth': 100000000},
     {'tangible net worth': 50000000}
    ]