Search code examples
dataframeseleniumcss-selectorswebdriverwebdriverwait

Web Scaping: Panda DataFrame.read_html(url_address) returns a Empty DataFrame?


I wanna web scrape the information of this table in this page that has many other pages.

enter image description here

I wrote the following code:

url = 'https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value='
pep_table = pd.read_html(url)

But the output was this:

pep_table
[Empty DataFrame
Columns: [ID, Name, N terminus, Sequence, C terminus, View]
Index: []]

I also tried to get it through selenium webdriver:

chromedriver = '/usr/local/bin/chromedriver'
driver = webdriver.Chrome(chromedriver)  
driver.get(url)
 table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#DataTables_Table_0_info")))  
    tableRows = table.get_attribute("outerHTML")
    df = pd.read_html(tableRows)[0]

But it shows the selenium webdriver timeout error:

File "/home/es/anaconda3/envs/pyg-env/lib/python3.7/site-packages/selenium/webdriver/support/wait.py", line 80, in until
    raise TimeoutException(message, screen, stacktrace)
selenium.common.exceptions.TimeoutException: Message: 
  1. Am I using the wrong selector?
  2. This page is the search results. Do I need to add more selectors?
  3. How to solve this issue?

Solution

  • Your table locator was wrong. I have modified that. The easiest way without clicking on the pagination button and navigating to url.

    You can use this url, where you have to change the offset value.

    url="https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value=&limit=30&offset={}"
    

    You need to create an empty dataframe and concat with it.

    Use time.sleep() to wait otherwise page will move faster and unable to capture all pages.

    Code:

    url="https://dbaasp.org/search?id.value=&name.value=&sequence.value=&sequence.option=full&sequenceLength.value=&complexity.value=&synthesisType.value=Nonribosomal&uniprot.value=&nTerminus.value=&cTerminus.value=&unusualAminoAcid.value=&intraChainBond.value=&interChainBond.value=&coordinationBond.value=&threeDStructure.value=&kingdom.value=&source.value=&hemolyticAndCytotoxicActivitie.value=on&synergy.value=&articleAuthor.value=&articleJournal.value=&articleYear.value=&articleVolume.value=&articlePages.value=&articleTitle.value=&limit=30&offset={}"
    counter=0
    df=pd.DataFrame()
    while counter <150:
       driver.get(url.format(counter))
       time.sleep(2)
       table = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "table#DataTables_Table_0")))  
       tableRows = table.get_attribute("outerHTML")
       df1 = pd.read_html(tableRows)[0]
       df = pd.concat([df,df1], ignore_index=True)
       counter=counter+30
    print(df)
    

    Output:

            ID              Name N terminus      Sequence C terminus  View
    0     1688  Gramicidin S, GS        NaN    VXLfPVXLfP        NaN  View
    1     3314      Gratisin, GR        NaN  VXLfPyVXLfPy        NaN  View
    2     3316  Tyrocidine A, TA        NaN    fPFfNQYVXL        NaN  View
    3     4876   Trichogin GA IV         C8   XGLXGGLXGIX        NaN  View
    4     5374         Baceridin        NaN        WaXVlL        NaN  View
    ..     ...               ...        ...           ...        ...   ...
    137  19210  Burkholdine-1215        NaN      xxGNSXXs        NaN  View
    138  19212  Burkholdine-1213        NaN      xnGNSNXs        NaN  View
    139  19548      Hirsutatin A        NaN        XTSXXF        NaN  View
    140  19549      Hirsutatin B        NaN        XTSXXX        NaN  View
    141  19554      Hirsutellide        NaN        XxIXxI        NaN  View
    
    [142 rows x 6 columns]