My HTML text looks like this..I want to extract only PLAIN TEXT from HTML text using REGEX in python (NOT USING HTML PARSERS)
<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>
How to find exact regex to get the plain text?
You might be better of using a parser here:
import html, xml.etree.ElementTree as ET
# decode
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""
# construct the dom
root = ET.fromstring(html.unescape(string))
# search it
for p in root.findall("*"):
print(p.text)
This yields
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
Obviously, you might want to change the xpath
, thus have a look at the possibilities.
It is possible to use a regular expression here, but this approach is really error-prone and not advisable:
import re
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""
rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')
print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']
The idea is to look for an uppercase letter and match word characters, whitespaces and commas up to a dot. See a demo on regex101.com.