Search code examples
pythonfileweb-scrapinghtmlsession

How can I get the content of an url and write into new file using HTMLSession in Python?


In beautifulsoup, we use response.content to render the text of the URL and create new file. What should we write if we use HTMLSession from requests_html instead of beautifulsoup?

For example,

import requests
from urllib.parse import urlparse
from requests_html import HTMLSession

session = HTMLSession()

# Specify the DOI here
URL="https://academic.oup.com/qje/article/126/4/1593/17089543/qjr041.pdf" 
r = session.get(URL,allow_redirects=True)
with open(pdf_title, "wb") as new_pdf:
    print(f"Begin writing to {pdf_title}")
    new_pdf.write(r.html.content) # This line is not working

Solution

  • This is all you need, although when I do this, I get "request forbidden by administrative rules". Presumably, you have the key to get past this.

    import requests
    
    pdf_title = "xyz.pdf"
    URL="https://academic.oup.com/qje/article/126/4/1593/17089543/qjr041.pdf" 
    r = requests.get(URL,allow_redirects=True)
    with open(pdf_title, "wb") as new_pdf:
        new_pdf.write(r.content)