Search code examples
pythonhtmlweb-scrapingbeautifulsouppython-beautifultable

Beautiful Soup can't find the part of the HTML I want


I've been using BeautifulSoup for Web Scraping for a while and this is the first time I encountered a problem like this. I am trying to select the number 101,172 in the code but even though I use .find or .select, the output is always only the tag, not the number. I worked with similar data collection before and hadn't had any problems

<div class="legend-block legend-block--pageviews">
      <h5>Pageviews</h5><hr>
      <div class="legend-block--body">
        <div class="linear-legend--counts">
          Pageviews:
          <span class="pull-right">
            101,172
          </span>
        </div>
        <div class="linear-legend--counts">
          Daily average:
          <span class="pull-right">
            4,818
          </span>
        </div></div></div>

I used:

res = requests.get(wiki_page, timeout =None)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
ab=soup.select('span[class="pull-right"]')
#print(i)
print(ab)

output:

[<span class="pull-right">\n<label class="logarithmic-scale">\n<input 
class="logarithmic-scale-option" type="checkbox"/>\n        Logarithmic scale      
</label>\n</span>, <span class="pull-right">\n<label class="begin-at- 
zero">\n<input class="begin-at-zero-option" type="checkbox"/>\n        Begin at 
zero      </label>\n</span>, <span class="pull-right">\n<label class="show- 
labels">\n<input class="show-labels-option" type="checkbox"/>\n        Show 
values      </label>\n</span>]

Additionally, the data number I am looking for is dynamic, so I am not sure if Javascript would affect BeautifulSoup


Solution

  • Try this:

    from bs4 import BeautifulSoup as bs
    
    html='''<div class="legend-block legend-block--pageviews">
          <h5>Pageviews</h5><hr>
          <div class="legend-block--body">
            <div class="linear-legend--counts">
              Pageviews:
              <span class="pull-right">101,172
              </span>
            </div>
            <div class="linear-legend--counts">
              Daily average:
              <span class="pull-right">
                4,818
              </span>
            </div></div></div>'''
    soup = bs(html, 'html.parser')
    div = soup.find("div", {"class": "linear-legend--counts"})
    span = div.find('span')
    text = span.get_text()
    print(text)
    

    output:

    101,172
    

    simply in one line:

    soup = bs(html, 'html.parser')
    result = soup.find("div", {"class": "linear-legend--counts"}).find('span').get_text()
    

    EDIT:

    As OP has posted another question which can be a possible duplicate for this one, He had found an answer. For someone who is looking for an answer for a similar kind of a question I will post the accepted answer for this question. It can be found here.

    The javascript code won't get executed if you retrieve page with the requests.get. So the selenium shall be used instead. It will mimic user like behaviour with the opening of the page in browser, so the js code will be executed.

    To start with selenium, you need to install with pip install selenium. Then to retrieve your item use code below:

    from selenium import webdriver
    
    browser = webdriver.Firefox()
    # List of the page url and selector of element to retrieve.
    wiki_pages = [("https://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&range=latest-20&pages=Star_Wars:_The_Last_Jedi",
                   ".summary-column--container .legend-block--pageviews .linear-legend--counts:first-child span.pull-right"),]
    for wiki_page in wiki_pages:
        url = wiki_page[0]
        selector = wiki_page[1]
        browser.get(wiki_page)
        page_views_count = browser.find_element_by_css_selector(selector)
        print page_views_count.text
    browser.quit()
    

    NOTE: If you need to run headless browser, consider using PyVirtualDisplay (a wrapper for Xvfb) to run headless WebDriver tests, see 'How do I run Selenium in Xvfb?' for more information.