Search code examples
pythonxpathlxmlurllib2xml.etree

Web elements extraction from websites using Python


I want to extract various elements from tables and paragraph texts from this website.

https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655

This is the code I am using:

import lxml
from lxml import html
from lxml import etree
import urllib2
source = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30656&SSO=1').read()
x = etree.HTML(source)
growth = x.xpath("//*[@id="home_feature_container"]/div/div[2]/div/table[2]/tbody/tr[3]/td[2]/p)")
growth

What is the best way to extract the elements I want from a website without having to change the XPath in the code every time? They publish new data in the same website every month, but the XPath seems to change a little bit sometimes.


Solution

  • If the position of the items you want changes regularly, try to retrieve them by name. Here is, for example, how to extract the elements from the table in the "New Orders" row.

    import requests #better than urllib
    from lxml import html, etree
    
    url = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
    page = requests.get(url)
    tree = html.fromstring(page.content)
    
    neworders = tree.xpath('//strong[text()="New Orders"]/../../following-sibling::td/p/text()')
    
    print(neworders)
    

    Or if you want the whole html table :

    data = tree.xpath('//th[text()="MANUFACTURING AT A GLANCE"]/../..')
    
    for elements in data:
        print(etree.tostring(elements, pretty_print=True))
    

    Another example using BeautifulSoup

    from bs4  import BeautifulSoup
    import requests
    
    url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1"
    
    content = requests.get(url).content
    
    soup = BeautifulSoup(content, "lxml")
    
    table = soup.find_all('table')[1]
    
    table_body = table.find('tbody')
    
    data= []
    rows = table_body.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])
    
    print(data)