Search code examples
pythonseleniumweb-scrapingdata-mining

How to use python/selenium to click on a row of a table to scrape set data?


I am fairly new to data scraping, so I apologize in advance if I am missing something basic.

My goal is to extract data from this database: https://ilthermo.boulder.nist.gov/

enter image description here

From what I've read, Beutiful Soup should be handy for these kind of task. Unfortunatelly the database's URL don't change as one navigates. And before I get to my pages of interest, I need to input a series of commands.

Namely, I need to:

(1) Click on "Search ILThermo" (2) Enter a component name (3) Select a number of components (4) Select a property

Up to this point, I have used Selenium and everything seems to work as it should.

Here is an example:

` from selenium import webdriver import time

chromedriver = "/users/Me/downloads/chromedriver"
driver = webdriver.Chrome(chromedriver)
driver.get('https://ilthermo.boulder.nist.gov/')

SearchILThermo = '//*[@id="sbutton_label"]'

IonicLiquid = '//*[@id="cmp"]'
NumberComp = '//*[@id="ncmp"]'
Property = '//*[@id="prp"]'
Submit = '//*[@id="sDialog"]/div[2]/div[2]/span[1]'

driver.find_element_by_xpath(SearchILThermo).click()
time.sleep(1)

driver.find_element_by_xpath(IonicLiquid).send_keys("1-Butyl-3-methylimidazolium tetrafluoroborate")
driver.find_element_by_xpath(NumberComp).send_keys('1 - pure compound')
driver.find_element_by_xpath(Property).send_keys('Viscosity')
time.sleep(1)

driver.find_element_by_xpath(Submit).click()`

These lines get me on the following page:

enter image description here

At this point, clicking on a row on the left pannel makes available a table on the right pannel. These tables on the right pannel are what I want to extract (probably using Beautiful Soup), starting with the first row on the left pannel and cicling over all rows of all pages for the set of inputs that I give above.

So, I need to:

(5) Click on a row on the left pannel (sequentially) (6) Extract the table on the rigth pannel using Beautiful Soup (haven't tried this yet) (7) Cicle over all rows and all pages on the left pannel

Clicking on a row on the left pannel is where I am currently stuck.

From the developer tools, it seems that the xpath for each row on the left pannel is something like:

1st row: //*[@id="dsgrid-row-MOByt"]/table/tr

2nd row: //*[@id="dsgrid-row-hkDds"]/table/tr/

If I do:

Row = '//*[@id="dsgrid-row-MOByt"]/table/tr' driver.find_element_by_xpath(Row).click()

It clicks on the first row as it should. But since I need to cicle over all rows, I would need to access the rows in a loop. Ideally something like:

Row = '//*[@id="dsgrid-row-i"]/table/tr'

But of course this doesn't work.

Are there other ways of getting Selenium to click on a specific row (given its number) in the left pannel?


Solution

  • Not sure if you really should use selenium - requests could also be an option while using the api.

    Here you get your results for your keyword:

    f'https://ilthermo.boulder.nist.gov/ILT2/ilsearch?cmp=&ncmp=0&year=&auth=&keyw={keyword}&prp=0'
    

    Here you get the corresponind data for each set:

    f"https://ilthermo.boulder.nist.gov/ILT2/ilset?set={d['setid']}"
    

    Example

    Limited / Sliced to the first five results for demo, simply fit it to your needs.

    import requests
    import pandas as pd
    
    keyword = 'Viscosity'
    url = f'https://ilthermo.boulder.nist.gov/ILT2/ilsearch?cmp=&ncmp=0&year=&auth=&keyw={keyword}&prp=0'
    
    ref_data = requests.get(url).json()
    
    data = []
    
    for e in ref_data['res'][:5]:
    
        d = dict(zip(ref_data['header'],e))
        set_data = requests.get(f"https://ilthermo.boulder.nist.gov/ILT2/ilset?set={d['setid']}").json()
        header = [item for items in set_data['dhead'] for item in items if item and item != 'Liquid']
        header.append('Liquid')
    
        for x in [[item for items in sublist for item in items] for sublist in set_data['data']]:
            d.update(
                dict(
                    zip(header, x)
                )
            )
            data.append(d)
    
    pd.DataFrame(data)
    

    Output

    setid ref prp phases cmp1 cmp2 cmp3 np nm1 nm2 Temperature, K Mole fraction of 1-butyl-1-methylpyrrolidinium dicyanamide Pressure, kPa Frequency, MHz Electrical conductivity, S/m Liquid Mole fraction of 1-butyl-3-methylimidazolium bis(trifluoromethylsulfonyl)imide Viscosity, Pa•s Mole fraction of 1-butyl-3-methylimidazolium methanesulfonate Molar volume, m3/mol
    0 KyTBg Zec et al. (2015) Electrical conductivity Liquid AAQcoP AAoDJf 1540 .gamma.-butyrolactone 1-butyl-1-methylpyrrolidinium dicyanamide 323.15 1 100 0.01 2.529 0.025 nan nan nan nan
    1 KyTBg Zec et al. (2015) Electrical conductivity Liquid AAQcoP AAoDJf 1540 .gamma.-butyrolactone 1-butyl-1-methylpyrrolidinium dicyanamide 323.15 1 100 0.01 2.529 0.025 nan nan nan nan
    2 KyTBg Zec et al. (2015) Electrical conductivity Liquid AAQcoP AAoDJf 1540 .gamma.-butyrolactone 1-butyl-1-methylpyrrolidinium dicyanamide 323.15 1 100 0.01 2.529 0.025 nan nan nan nan
    3 KyTBg Zec et al. (2015) Electrical conductivity Liquid AAQcoP AAoDJf 1540 .gamma.-butyrolactone 1-butyl-1-methylpyrrolidinium dicyanamide 323.15 1 100 0.01 2.529 0.025 nan nan nan nan
    4 KyTBg Zec et al. (2015) Electrical conductivity Liquid AAQcoP AAoDJf 1540 .gamma.-butyrolactone 1-butyl-1-methylpyrrolidinium dicyanamide 323.15 1 100 0.01 2.529 0.025 nan nan nan nan
    4251 oyqwN Safarov et al. (2018c) Viscosity Liquid ABNWKs 394 1-octyl-3-methylimidazolium hexafluorophosphate nan 413.82 nan 101.325 nan nan 0.0004 nan 0.0097 nan nan
    4252 oyqwN Safarov et al. (2018c) Viscosity Liquid ABNWKs 394 1-octyl-3-methylimidazolium hexafluorophosphate nan 413.82 nan 101.325 nan nan 0.0004 nan 0.0097 nan nan
    4253 oyqwN Safarov et al. (2018c) Viscosity Liquid ABNWKs 394 1-octyl-3-methylimidazolium hexafluorophosphate nan 413.82 nan 101.325 nan nan 0.0004 nan 0.0097 nan nan
    4254 oyqwN Safarov et al. (2018c) Viscosity Liquid ABNWKs 394 1-octyl-3-methylimidazolium hexafluorophosphate nan 413.82 nan 101.325 nan nan 0.0004 nan 0.0097 nan nan
    4255 oyqwN Safarov et al. (2018c) Viscosity Liquid ABNWKs 394 1-octyl-3-methylimidazolium hexafluorophosphate nan 413.82 nan 101.325 nan nan 0.0004 nan 0.0097 nan nan