Search code examples
regexpython-3.xparsingfindall

re.findall between two strings (but dismiss numeric digits)


I am trying to parse many txt files. The following textis just a part of a bigger txt files.

<P STYLE="font: 10pt Times New Roman, Times, Serif; margin: 0; text-align: justify">Prior to this primary offering, there has
been no public market for our common stock. We anticipate that the public offering price of the shares will be between $5.00 and
$6.00. We have applied to list our common stock on the Nasdaq Capital Market (&ldquo;Nasdaq&rdquo;) under the symbol &ldquo;HYRE.&rdquo;
If our application is not approved or we otherwise determine that we will not be able to secure the listing of our common stock
on the Nasdaq, we will not complete this primary offering.</P>

My desired output: be between $5.00 and and $6.00. So, I need to extract anything between the be betweenuntil the following . (but not taking into account the decimal 5.00 point!). I tried the following (Python 3.7):

shareprice = re.findall(r"be between\s\$.+?\.", text, re.DOTALL) 

But this code gives me: be between $5. (stops at the decimal point). I initially add a \s at the end of the string to require a white space after the . which would keep the 5.00 point decimal, but many other txt files do not have a white space right after the ending . of the sentence. Is there anyway I can specify in my string that I want to "skip" numeric digits after the \.?

Thank you very much. I hope it was clear. Best


Solution

  • After parsing the plain text out of the HTML, you may consider matching any 0+ chars as few as possible followed with a . that is not followed with a digit:

    r"be between\s*\$.*?\.(?!\d)"
    

    See the regex demo.

    Alternatively, if you only want to ignore the dot STRICTLY in between two digits you may use

    r"be between\s*\$.*?\.(?!(?<=\d\.)\d)"
    

    See this regex demo. The (?!(?<=\d\.)\d) makes sure the \d\.\d pattern is skipped up to the first matching ., and not just \.\d.