Search code examples
pythonlisttext-extraction

Python re.search to regex and extract from each item in a list


I have a list of html tags from a Beautiful Soup output. I want to extract the text within each tag and place into a list (spec_names).

li_tags = [<li>Brand: STIHL</li>, <li>Product: Chainsaw</li>,<li>Bar Length: 18 inch</li>, <li>Chain Brake: Yes</li>, <li>Weight: 14 pound</li>, <li>PoweredBy: Gas</li>]

I thought this would do it:

pattern = r'(?<=\<li\>).+?(?=\:)'
spec_names=[]
for x  in li_tags:
    spec_names.append(re.search(pattern,x))

Also thought this would do it:

pattern = r'(?<=\<li\>).+?(?=\:)'
spec_names=[]
spec_names= [re.search(pattern,x) for x in li_tags]

There is a lot of online help checking to see if each list item is a match, but I am wanting to extract the match from inside each list item. The end result would have spec_names as :

['Brand', 'Product', 'Bar Length', 'Chain Brake', 'Weight', 'Powered By']

I am not looking for a function, but procedural steps. Thank you in advance.


Solution

  • You don't usually use regex to parse text from beautifulsoup tags. Use .text property:

    from bs4 import BeautifulSoup
    
    html_doc = """\
    <li>Brand: STIHL</li>
    <li>Product: Chainsaw</li>
    <li>Bar Length: 18 inch</li>
    <li>Chain Brake: Yes</li>
    <li>Weight: 14 pound</li>
    <li>PoweredBy: Gas</li>"""
    
    soup = BeautifulSoup(html_doc, "html.parser")
    
    li_tags = soup.select("li")
    
    spec_names = [tag.text.split(":")[0] for tag in li_tags]
    print(spec_names)
    

    Prints:

    ['Brand', 'Product', 'Bar Length', 'Chain Brake', 'Weight', 'PoweredBy']