Search code examples
pythonweb-scrapingpython-requestsopenpyxl

Downloading an Excel file from an URL using Python


The URL for Excel file is this: https://www.gso.gov.vn/wp-content/uploads/2024/03/IIP-ENG.xlsx

I have this code:

from datetime import datetime, timedelta

url = 'https://www.gso.gov.vn/wp-content/uploads/' + datetime.strftime(datetime.now() - timedelta(30), '%y') +'/' + datetime.strftime(datetime.now() - timedelta(30), '%m') + '/IIP-ENG.xlsx'

import requests
resp = requests.get(url, verify=False)
output = open('IIP.xlsx', 'wb')
output.write(resp.content)
output.close()

I can see a file being downloaded but I can't open it in Office Excel. The file is corrupted.

resp

<Response [404]>

I also cant open using this code:

import pandas as pd
df = pd.read_excel(open('IIP.xlsx', 'rb'),sheet_name=0, engine='openpyxl')
print(df.head(5)) 

BadZipFile error. The file is not a Zip file.

How to fix this ?


Solution

  • The issue is with the year format, '%y' will give 24, you need '%Y' for 2024

    datetime.strftime(datetime.now() - timedelta(30), '%Y')