Search code examples
pythonparsingline

Python extract text with line cuts


I am using Python 3.7 and have a test.txt file that looks like this:

<P align="left">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;and
$&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
&#147;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&#148;.
</FONT>

I need to extract everything that follows the "be between" (row 4) until "per share" (row 7). Here is the code I run:

price = []
with open("test.txt", 'r') as f:
    for line in f:
        if "be between" in line:
            price.append(line.rstrip().replace('&nbsp;','')) #remove '\n' and '&nbsp;'
print(price)
['of our common stock is expected to be between']

I first locate the "be between" and then ask to append the line, but the problem is that everything that comes next is cut because it is in the following lines.

My desired output would be:

['of our common stock is expected to be between $ and $ per share']

How can I do it? Thank you very much in advance.


Solution

  • The right way with html.unescape and re.search features:

    import re
    from html import unescape
    
    price_texts = []
    with open("test.txt", 'r') as f:
        content = unescape(f.read())
        m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
        if m:
            price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))
    
    print(price_texts)
    

    The output:

    [' of our common stock is expected to be between $ and $ per share']