I'm trying to scrape excel-files from the Finnish drug price agency
I'm using requests-html to find the links to the excel files:
from requests_html import HTMLSession
import urllib.request
url = 'http://www.hila.fi/fi/hakeminen_ja_ilmoitukset/viitehintajarjestelma/ryhmat_ja_hinnat/viitehintapaatokset2009'
session = HTMLSession()
r = session.get(url)
sel = 'a[href*=".xls"]'
reference_datas = r.html.find(sel)
for reference_data in reference_datas:
url = reference_data.absolute_links.pop()
response = urllib.request.urlopen(url)
with open('test.xls', 'wb') as f:
f.write(response.read())
This works fine for the content of the excel files, but the selected elements do not have information on the names of the files. The filenames contain information on the period when the prices in the files apply. For example the link http://www.hila.fi/c/document_library/get_file?folderId=792534&name=DLFE-4531.xls
gets the file Viitehintaluettelo Q4_2009_paivitetty.xls
.
How can I get this filename as a string so that I can extract the time information Q4_2009
from it?
You can access it via headers.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://www.hila.fi/c/document_library/get_file?folderId=792534&name=DLFE-4531.xls')
content_disposition = r.headers.get('Content-Disposition')
print(content_disposition)
# 'attachment; filename="Viitehintaluettelo Q4_2009_paivitetty.xls"'
Just parse filename
from content_disposition
. You can review Content-Disposition Spec here .