Search code examples
htmlregexpython-3.xregular-language

extract text from html tags using regex


My HTML text looks like this..I want to extract only PLAIN TEXT from HTML text using REGEX in python (NOT USING HTML PARSERS)

<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>

How to find exact regex to get the plain text?


Solution

  • You might be better of using a parser here:

    import html, xml.etree.ElementTree as ET
    
    # decode
    string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
    Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
    </span></p>"""
    
    # construct the dom
    root = ET.fromstring(html.unescape(string))
    
    # search it
    for p in root.findall("*"):
        print(p.text)
    

    This yields

    Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
    

    Obviously, you might want to change the xpath, thus have a look at the possibilities.


    Addendum:

    It is possible to use a regular expression here, but this approach is really error-prone and not advisable:

    import re
    
    string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
    Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
    </span></p>"""
    
    rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')
    
    print(rx.findall(string))
    # ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']
    

    The idea is to look for an uppercase letter and match word characters, whitespaces and commas up to a dot. See a demo on regex101.com.