Search code examples
pythonhtmlweb-scrapingbeautifulsouphtml-parsing

beautifulsoup parse html content


I need to get date out from each html files. I tried find_siblings('p'), but returns None.

Date is under tags below (mostly the third p tag) but sometimes is with the first tag of id="a-body"

<div class="sa-art article-width" id="a-body" itemprop="articleBody">
    <p class="p p1">text1</p>
    <p class="p p1">text2</p>
    <p class="p p1">
    January 6, 2009  8:00 am ET
    </p>
    ..
    ..
    ..
</div>

or

Inside the first tag but include other information.

<div class="sa-art article-width" id="a-body" itemprop="articleBody">
    <p class="p p1">
      participant text1 text2 text3 January  8, 2009  5:00 PM ET
    </p>
    <p class="p p1">text</p>
    <p class="p p1">text</p>
    ..
    ..
</div>

My code is just simply to find the third p, but if it's within the first p with other content, I don't know how to do it:

fo = open('C:/Users/output1/4069369.html', "r") 
soup = bs4.BeautifulSoup(fo, "lxml")

d_date = soup.find_all('p')[2]
print d_date.get_text(strip=True)

Solution

  • The thing is that you have to find the element p with date, then you can work with a months list, like this:

    from bs4 import BeautifulSoup
    div_test='<div class="sa-art article-width" id="a-body" itemprop="articleBody">\
    <p class="p p1">text1</p>\
    <p class="p p1">\
      participant text1 text2 text3 January  8, 2009  5:00 a.m. EST\
    </p>\
    <p class="p p1">text2</p>\
    <p class="p p1">\
    January 6, 2009  8:00 pm ET\
    </p></div>'
    soup = BeautifulSoup(div_test, "lxml")
    month_list = ['January','February','March','April','May','June','July','August','September','October','November','December']
    
    def first_date_p():
        for p in soup.find_all('p',{"class":"p p1"}):
            for month in month_list:
                if month in p.get_text():
                    first_date_p = p.get_text()
                    date_start= first_date_p.index(month)
                    date_text = first_date_p[date_start:]
                    return date_text
    first_date_p()
    

    It will output the first p element which has date, no matter the element's position, in other words, it contains month:

    u'January  8, 2009  5:00 a.m. EST'