I want to extract various elements from tables and paragraph texts from this website.
https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655
This is the code I am using:
import lxml
from lxml import html
from lxml import etree
import urllib2
source = urllib2.urlopen('https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30656&SSO=1').read()
x = etree.HTML(source)
growth = x.xpath("//*[@id="home_feature_container"]/div/div[2]/div/table[2]/tbody/tr[3]/td[2]/p)")
growth
What is the best way to extract the elements I want from a website without having to change the XPath in the code every time? They publish new data in the same website every month, but the XPath seems to change a little bit sometimes.
If the position of the items you want changes regularly, try to retrieve them by name. Here is, for example, how to extract the elements from the table in the "New Orders" row.
import requests #better than urllib
from lxml import html, etree
url = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
page = requests.get(url)
tree = html.fromstring(page.content)
neworders = tree.xpath('//strong[text()="New Orders"]/../../following-sibling::td/p/text()')
print(neworders)
Or if you want the whole html table :
data = tree.xpath('//th[text()="MANUFACTURING AT A GLANCE"]/../..')
for elements in data:
print(etree.tostring(elements, pretty_print=True))
Another example using BeautifulSoup
from bs4 import BeautifulSoup
import requests
url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1"
content = requests.get(url).content
soup = BeautifulSoup(content, "lxml")
table = soup.find_all('table')[1]
table_body = table.find('tbody')
data= []
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
print(data)