Search code examples
pythonhtmlpython-re

how to find a piece of text between <h3> and </h3> in an html page with python


There is an html page you need to collect the text in the list, which is contained between the h3 and /h3 tags

<h3 id="basics">1. Creating a Web Page</h3>
<p>

Once you've made your "home page" (index.html) you can add more pages to
your site, and your home page can link to them.

<h3 id="syntax">>2. HTML Syntax</h3>

i dont know how to write a pattern for this, pls help to get values "1. Creating a Web Page" and ">2. HTML Syntax"


Solution

  • you can use library like beautifulsoup for crawling webpages.

    import requests
    from bs4 import BeautifulSoup
    html = requests.get('url to your page')
    html.encoding = 'utf-8'
    sp = BeautifulSoup(html.text, "html5lib")
    
    # to get all h3 in the page
    list_h3 = sp.find_all('h3')
    for h3 in list_h3:
        print(h3.text)