Search code examples
pythonhtmlbeautifulsouphtml-parsing

Is there any python package which essentially converts the presented HTML structure into JSON/YAML format


For Example there is a code present in HTML

<p>Example of a paragraph element.</p> 
<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>

needs to represented into (in case of a yaml format) or json is also fine

p: Example of a paragraph element.
ul:
   li:Coffee
   li:Tea
   li:Milk

Solution

  • Not sure there is a package, but you could just iterate through each tag in the html, then use .name and .text to work it out hat way, and write to file:

    html = '''<p>Example of a paragraph element.</p> 
    <ul>
      <li>Coffee</li>
      <li>Tea</li>
      <li>Milk</li>
    </ul>'''
    
    
    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup(html, 'html.parser')
    
    for tag in soup.find_all():
        print (tag.name + ':' + tag.text)
    

    Output:

    p:Example of a paragraph element.
    ul:
    Coffee
    Tea
    Milk
    
    li:Coffee
    li:Tea
    li:Milk