Search code examples
javascriptpythonweb-scraping

Python web scraping for javascript generated content


I am trying to use python3 to return the bibtex citation generated by http://www.doi2bib.org/. The url's are predictable so the script can work out the url without having to interact with the web page. I have tried using selenium, bs4, etc but cant get the text inside the box.

url = "http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9"
import urllib.request
from bs4 import BeautifulSoup
text = BeautifulSoup(urllib.request.urlopen(url).read())
print(text)

Can anyone suggest a way of returning the bibtex citation as a string (or whatever) in python?


Solution

  • You don't need BeautifulSoup here. There is an additional XHR request sent to the server to fill out the bibtex citation, simulate it, for example, with requests:

    import requests
    
    bibtex_id = '10.1007/s00425-007-0544-9'
    
    url = "http://www.doi2bib.org/#/doi/{id}".format(id=bibtex_id)
    xhr_url = 'http://www.doi2bib.org/doi2bib'
    
    with requests.Session() as session:
        session.get(url)
    
        response = session.get(xhr_url, params={'id': bibtex_id})
        print(response.content)
    

    Prints:

    @article{Burgert_2007,
        doi = {10.1007/s00425-007-0544-9},
        url = {http://dx.doi.org/10.1007/s00425-007-0544-9},
        year = 2007,
        month = {jun},
        publisher = {Springer Science $\mathplus$ Business Media},
        volume = {226},
        number = {4},
        pages = {981--987},
        author = {Ingo Burgert and Michaela Eder and Notburga Gierlinger and Peter Fratzl},
        title = {Tensile and compressive stresses in tracheids are induced by swelling based on geometrical constraints of the wood cell},
        journal = {Planta}
    }
    

    You can also solve it with selenium. The key trick here is to use an Explicit Wait to wait for the citation to become visible:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Firefox()
    driver.get('http://www.doi2bib.org/#/doi/10.1007/s00425-007-0544-9')
    
    element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//pre[@ng-show="bib"]')))
    print(element.text)
    
    driver.close()
    

    Prints the same as the above solution.