I am using Python 3.7 and have a test.txt file that looks like this:
<P align="left">
<FONT size="2">Prior to this offering, there has been no public
market for our common stock. The initial public offering price
of our common stock is expected to be between
$ and
$ per
share. We intend to list our common stock on the Nasdaq National
Market under the symbol
“ ”.
</FONT>
I need to extract everything that follows the "be between" (row 4) until "per share" (row 7). Here is the code I run:
price = []
with open("test.txt", 'r') as f:
for line in f:
if "be between" in line:
price.append(line.rstrip().replace(' ','')) #remove '\n' and ' '
print(price)
['of our common stock is expected to be between']
I first locate the "be between" and then ask to append the line, but the problem is that everything that comes next is cut because it is in the following lines.
My desired output would be:
['of our common stock is expected to be between $ and $ per share']
How can I do it? Thank you very much in advance.
The right way with html.unescape
and re.search
features:
import re
from html import unescape
price_texts = []
with open("test.txt", 'r') as f:
content = unescape(f.read())
m = re.search(r'price\b(.+\bper\s+share\b)', content, re.DOTALL)
if m:
price_texts.append(re.sub(r'\s{2,}|\n', ' ', m.group(1)))
print(price_texts)
The output:
[' of our common stock is expected to be between $ and $ per share']