Search code examples
pythoncsvweb-scrapingbeautifulsoupurllib

My code wrongfully downloads a CSV file from an URL with Python


I created some code to download a CSV file from an URL. The code downloads the HTML code of the link, but when I copy the url that I created in a browser it works, but it does not in the code.

I tried os, response, and urllib, but all these options provided the same result.

This is the link that I ultimately want to download as CSV: https://www.ishares.com/uk/individual/en/products/251567/ishares-asia-pacific-dividend-ucits-etf/1506575576011.ajax?fileType=csv&fileName=IAPD_holdings&dataType=fund

import requests
#this is the url where the csv is
url='https://www.ishares.com/uk/individual/en/products/251567/ishares-asia-pacific-dividend-ucits-etf?switchLocale=y&siteEntryPassthrough=true'
r = requests.get(url, allow_redirects=True)
response = requests.get(url)
if response.status_code == 200:
    print("Success")
else:
    print("Failure")

#find the url for the CSV
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content,'lxml')
for i in soup.find_all('a',{'class':"icon-xls-export"}):
    print(i.get('href'))

# I get two types of files, one CSV and the other xls. 
link_list=[]
for i in soup.find_all('a', {'class':"icon-xls-export"}):
    link_list.append(i.get('href'))

# I create the link with the CSV
url_csv = "https://www.ishares.com//"+link_list[0]
response_csv = requests.get(url_csv)
if response_csv.status_code == 200:
    print("Success")
else:
    print("Failure")

#Here I want to download the file
import urllib.request
with urllib.request.urlopen(url_csv) as holdings1, open('dataset.csv', 'w') as f:
    f.write(holdings1.read().decode())

I would like to get the CSV data downloaded.


Solution

  • It needs cookies to work correctly

    I use requests.Session() to get and keep cookies automatically.

    And I write in file response_csv.content because I already have it after second requests - so I don't have to make another requests. And because using urllib.request I will create requests without cookies and it may not works.

    import requests
    from bs4 import BeautifulSoup
    
    s = requests.Session()
    
    url='https://www.ishares.com/uk/individual/en/products/251567/ishares-asia-pacific-dividend-ucits-etf?switchLocale=y&siteEntryPassthrough=true'
    
    response = s.get(url, allow_redirects=True)
    
    if response.status_code == 200:
        print("Success")
    else:
        print("Failure")
    
    #find the url for the CSV
    soup = BeautifulSoup(response.content,'lxml')
    
    for i in soup.find_all('a',{'class':"icon-xls-export"}):
        print(i.get('href'))
    
    # I get two types of files, one CSV and the other xls. 
    link_list=[]
    for i in soup.find_all('a', {'class':"icon-xls-export"}):
        link_list.append(i.get('href'))
    
    # I create the link with the CSV
    url_csv = "https://www.ishares.com//"+link_list[0]
    
    response_csv = s.get(url_csv)
    
    if response_csv.status_code == 200:
        print("Success")
        f = open('dataset.csv', 'wb')
        f.write(response_csv.content)
        f.close()
    else:
        print("Failure")