Search code examples
pythonexcelweb-scrapingpython-requestspython-requests-html

How to get the filename of an .xls file on a website with Python (requests-html)


I'm trying to scrape excel-files from the Finnish drug price agency

I'm using requests-html to find the links to the excel files:

from requests_html import HTMLSession
import urllib.request
url = 'http://www.hila.fi/fi/hakeminen_ja_ilmoitukset/viitehintajarjestelma/ryhmat_ja_hinnat/viitehintapaatokset2009'
session = HTMLSession()
r = session.get(url)
sel = 'a[href*=".xls"]'
reference_datas = r.html.find(sel)

for reference_data in reference_datas:
    url = reference_data.absolute_links.pop()
    response = urllib.request.urlopen(url)
    with open('test.xls', 'wb') as f:
        f.write(response.read())

This works fine for the content of the excel files, but the selected elements do not have information on the names of the files. The filenames contain information on the period when the prices in the files apply. For example the link http://www.hila.fi/c/document_library/get_file?folderId=792534&name=DLFE-4531.xls gets the file Viitehintaluettelo Q4_2009_paivitetty.xls.

How can I get this filename as a string so that I can extract the time information Q4_2009 from it?


Solution

  • You can access it via headers.

    from requests_html import HTMLSession
    session = HTMLSession()
    r = session.get('http://www.hila.fi/c/document_library/get_file?folderId=792534&name=DLFE-4531.xls')
    content_disposition =  r.headers.get('Content-Disposition')
    print(content_disposition)
    #  'attachment; filename="Viitehintaluettelo Q4_2009_paivitetty.xls"'
    

    Just parse filename from content_disposition. You can review Content-Disposition Spec here .