extract text from html tags using regex

My HTML text looks like this..I want to extract only PLAIN TEXT from HTML text using REGEX in python (NOT USING HTML PARSERS)

&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;

How to find exact regex to get the plain text?

Solution

You might be better of using a parser here:

import html, xml.etree.ElementTree as ET

# decode
string = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;"""

# construct the dom
root = ET.fromstring(html.unescape(string))

# search it
for p in root.findall("*"):
    print(p.text)

This yields

Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

Obviously, you might want to change the xpath, thus have a look at the possibilities.

Addendum:

It is possible to use a regular expression here, but this approach is really error-prone and not advisable:

import re

string = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&quot;font-size: small; font-family: lato, arial, h elvetica, sans-serif;&quot;&gt;
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
&lt;/span&gt;&lt;/p&gt;"""

rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')

print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']

The idea is to look for an uppercase letter and match word characters, whitespaces and commas up to a dot. See a demo on regex101.com.