I am fairly new to data scraping, so I apologize in advance if I am missing something basic.
My goal is to extract data from this database: https://ilthermo.boulder.nist.gov/
From what I've read, Beutiful Soup should be handy for these kind of task. Unfortunatelly the database's URL don't change as one navigates. And before I get to my pages of interest, I need to input a series of commands.
Namely, I need to:
(1) Click on "Search ILThermo" (2) Enter a component name (3) Select a number of components (4) Select a property
Up to this point, I have used Selenium and everything seems to work as it should.
Here is an example:
` from selenium import webdriver import time
chromedriver = "/users/Me/downloads/chromedriver"
driver = webdriver.Chrome(chromedriver)
driver.get('https://ilthermo.boulder.nist.gov/')
SearchILThermo = '//*[@id="sbutton_label"]'
IonicLiquid = '//*[@id="cmp"]'
NumberComp = '//*[@id="ncmp"]'
Property = '//*[@id="prp"]'
Submit = '//*[@id="sDialog"]/div[2]/div[2]/span[1]'
driver.find_element_by_xpath(SearchILThermo).click()
time.sleep(1)
driver.find_element_by_xpath(IonicLiquid).send_keys("1-Butyl-3-methylimidazolium tetrafluoroborate")
driver.find_element_by_xpath(NumberComp).send_keys('1 - pure compound')
driver.find_element_by_xpath(Property).send_keys('Viscosity')
time.sleep(1)
driver.find_element_by_xpath(Submit).click()`
These lines get me on the following page:
At this point, clicking on a row on the left pannel makes available a table on the right pannel. These tables on the right pannel are what I want to extract (probably using Beautiful Soup), starting with the first row on the left pannel and cicling over all rows of all pages for the set of inputs that I give above.
So, I need to:
(5) Click on a row on the left pannel (sequentially) (6) Extract the table on the rigth pannel using Beautiful Soup (haven't tried this yet) (7) Cicle over all rows and all pages on the left pannel
Clicking on a row on the left pannel is where I am currently stuck.
From the developer tools, it seems that the xpath for each row on the left pannel is something like:
1st row:
//*[@id="dsgrid-row-MOByt"]/table/tr
2nd row:
//*[@id="dsgrid-row-hkDds"]/table/tr/
If I do:
Row = '//*[@id="dsgrid-row-MOByt"]/table/tr' driver.find_element_by_xpath(Row).click()
It clicks on the first row as it should. But since I need to cicle over all rows, I would need to access the rows in a loop. Ideally something like:
Row = '//*[@id="dsgrid-row-i"]/table/tr'
But of course this doesn't work.
Are there other ways of getting Selenium to click on a specific row (given its number) in the left pannel?
Not sure if you really should use selenium
- requests
could also be an option while using the api.
Here you get your results for your keyword:
f'https://ilthermo.boulder.nist.gov/ILT2/ilsearch?cmp=&ncmp=0&year=&auth=&keyw={keyword}&prp=0'
Here you get the corresponind data for each set:
f"https://ilthermo.boulder.nist.gov/ILT2/ilset?set={d['setid']}"
Limited / Sliced to the first five results for demo, simply fit it to your needs.
import requests
import pandas as pd
keyword = 'Viscosity'
url = f'https://ilthermo.boulder.nist.gov/ILT2/ilsearch?cmp=&ncmp=0&year=&auth=&keyw={keyword}&prp=0'
ref_data = requests.get(url).json()
data = []
for e in ref_data['res'][:5]:
d = dict(zip(ref_data['header'],e))
set_data = requests.get(f"https://ilthermo.boulder.nist.gov/ILT2/ilset?set={d['setid']}").json()
header = [item for items in set_data['dhead'] for item in items if item and item != 'Liquid']
header.append('Liquid')
for x in [[item for items in sublist for item in items] for sublist in set_data['data']]:
d.update(
dict(
zip(header, x)
)
)
data.append(d)
pd.DataFrame(data)
setid | ref | prp | phases | cmp1 | cmp2 | cmp3 | np | nm1 | nm2 | Temperature, K | Mole fraction of 1-butyl-1-methylpyrrolidinium dicyanamide | Pressure, kPa | Frequency, MHz | Electrical conductivity, S/m | Liquid | Mole fraction of 1-butyl-3-methylimidazolium bis(trifluoromethylsulfonyl)imide | Viscosity, Pa•s | Mole fraction of 1-butyl-3-methylimidazolium methanesulfonate | Molar volume, m3/mol | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | KyTBg | Zec et al. (2015) | Electrical conductivity | Liquid | AAQcoP | AAoDJf | 1540 | .gamma.-butyrolactone | 1-butyl-1-methylpyrrolidinium dicyanamide | 323.15 | 1 | 100 | 0.01 | 2.529 | 0.025 | nan | nan | nan | nan | |
1 | KyTBg | Zec et al. (2015) | Electrical conductivity | Liquid | AAQcoP | AAoDJf | 1540 | .gamma.-butyrolactone | 1-butyl-1-methylpyrrolidinium dicyanamide | 323.15 | 1 | 100 | 0.01 | 2.529 | 0.025 | nan | nan | nan | nan | |
2 | KyTBg | Zec et al. (2015) | Electrical conductivity | Liquid | AAQcoP | AAoDJf | 1540 | .gamma.-butyrolactone | 1-butyl-1-methylpyrrolidinium dicyanamide | 323.15 | 1 | 100 | 0.01 | 2.529 | 0.025 | nan | nan | nan | nan | |
3 | KyTBg | Zec et al. (2015) | Electrical conductivity | Liquid | AAQcoP | AAoDJf | 1540 | .gamma.-butyrolactone | 1-butyl-1-methylpyrrolidinium dicyanamide | 323.15 | 1 | 100 | 0.01 | 2.529 | 0.025 | nan | nan | nan | nan | |
4 | KyTBg | Zec et al. (2015) | Electrical conductivity | Liquid | AAQcoP | AAoDJf | 1540 | .gamma.-butyrolactone | 1-butyl-1-methylpyrrolidinium dicyanamide | 323.15 | 1 | 100 | 0.01 | 2.529 | 0.025 | nan | nan | nan | nan | |
4251 | oyqwN | Safarov et al. (2018c) | Viscosity | Liquid | ABNWKs | 394 | 1-octyl-3-methylimidazolium hexafluorophosphate | nan | 413.82 | nan | 101.325 | nan | nan | 0.0004 | nan | 0.0097 | nan | nan | ||
4252 | oyqwN | Safarov et al. (2018c) | Viscosity | Liquid | ABNWKs | 394 | 1-octyl-3-methylimidazolium hexafluorophosphate | nan | 413.82 | nan | 101.325 | nan | nan | 0.0004 | nan | 0.0097 | nan | nan | ||
4253 | oyqwN | Safarov et al. (2018c) | Viscosity | Liquid | ABNWKs | 394 | 1-octyl-3-methylimidazolium hexafluorophosphate | nan | 413.82 | nan | 101.325 | nan | nan | 0.0004 | nan | 0.0097 | nan | nan | ||
4254 | oyqwN | Safarov et al. (2018c) | Viscosity | Liquid | ABNWKs | 394 | 1-octyl-3-methylimidazolium hexafluorophosphate | nan | 413.82 | nan | 101.325 | nan | nan | 0.0004 | nan | 0.0097 | nan | nan | ||
4255 | oyqwN | Safarov et al. (2018c) | Viscosity | Liquid | ABNWKs | 394 | 1-octyl-3-methylimidazolium hexafluorophosphate | nan | 413.82 | nan | 101.325 | nan | nan | 0.0004 | nan | 0.0097 | nan | nan |