I have a list of html tags from a Beautiful Soup output. I want to extract the text within each tag and place into a list (spec_names).
li_tags = [<li>Brand: STIHL</li>, <li>Product: Chainsaw</li>,<li>Bar Length: 18 inch</li>, <li>Chain Brake: Yes</li>, <li>Weight: 14 pound</li>, <li>PoweredBy: Gas</li>]
I thought this would do it:
pattern = r'(?<=\<li\>).+?(?=\:)'
spec_names=[]
for x in li_tags:
spec_names.append(re.search(pattern,x))
Also thought this would do it:
pattern = r'(?<=\<li\>).+?(?=\:)'
spec_names=[]
spec_names= [re.search(pattern,x) for x in li_tags]
There is a lot of online help checking to see if each list item is a match, but I am wanting to extract the match from inside each list item. The end result would have spec_names as :
['Brand', 'Product', 'Bar Length', 'Chain Brake', 'Weight', 'Powered By']
I am not looking for a function, but procedural steps. Thank you in advance.
You don't usually use regex to parse text from beautifulsoup
tags. Use .text
property:
from bs4 import BeautifulSoup
html_doc = """\
<li>Brand: STIHL</li>
<li>Product: Chainsaw</li>
<li>Bar Length: 18 inch</li>
<li>Chain Brake: Yes</li>
<li>Weight: 14 pound</li>
<li>PoweredBy: Gas</li>"""
soup = BeautifulSoup(html_doc, "html.parser")
li_tags = soup.select("li")
spec_names = [tag.text.split(":")[0] for tag in li_tags]
print(spec_names)
Prints:
['Brand', 'Product', 'Bar Length', 'Chain Brake', 'Weight', 'PoweredBy']