Search code examples
pythonpandashttpweb-scrapingurllib

Scrape a table from a website and store as pandas


In Python, I want to scrape the table in a website(it's a Japanese option trading information), and store it as a pandas dataframe.

The website is here, and you need to click "Options Quotes" in order to access the page where I want to scrape the table. The final URL is https://svc.qri.jp/jpx/english/nkopm/ but you cannot directly access this page.

Here is my attempt:

pd.read_html("https://svc.qri.jp/jpx/english/nkopm/")
...HTTPError: HTTP Error 400: Bad Request

So I thought I need to add a user agent. Here is my another attempt:

url = "https://svc.qri.jp/jpx/english/nkopm/"
pd.read_html(requests.get(url, headers={'User-agent': 'Mozilla/5.0'}).text)
...ValueError: No tables found

Another attempt

import urllib
url = 'https://svc.qri.jp/jpx/english/nkopm/'
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open(url)
tables = pd.read_html(response.read(), attrs={"class":"price-table"})[0]
...HTTPError: HTTP Error 400: Bad Request

I know how to play with pandas, so it doesn't have to be imported in a neat dataframe at first place. I just need to first import the table in pandas, but I'm not sure why I cannot even read the page. Any help would be appreciated!

By the way, if you click gray arrows in the middle column, enter image description here,

it will add another row like this. enter image description here

And it can be all opened and closed by clicking these buttons. enter image description here

It would be nice if I can import these rows as well, but it not really a must.


Solution

  • Reading the documentation of the pandas function read_html it says

    Read HTML tables into a list of DataFrame objects.

    So the function expects structured input in form of an html table. I actually can't access the website you're linking to but I'm guessing it will give you back an entire website.

    You need to extract the data in a structured format in order for pandas to make sense of it. You need to scrape it. There's a bunch of tools for that, and one popular one is BeautifulSoup.

    Tl;dr: So what you need to do is download the website with requests, pass it into BeautifulSoup and then use BeautifulSoup to extract the data in a structured format.


    Updated answer:

    Seems like the reason why the requests is returning a 400 is because the website is expecting some additional headers - I just dumped the request my browser does into requests and it works!

    import requests
    
    headers = {
        'Connection': 'keep-alive',
        'Cache-Control': 'max-age=0',
        'Upgrade-Insecure-Requests': '1',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'Sec-Fetch-Site': 'cross-site',
        'Sec-Fetch-Mode': 'navigate',
        'Sec-Fetch-User': '?1',
        'Sec-Fetch-Dest': 'document',
        'Referer': 'https://www.jpx.co.jp/english/markets/index.html',
        'Accept-Language': 'de-DE,de;q=0.9,en-US;q=0.8,en;q=0.7,it;q=0.6,la;q=0.5',
    }
    
    response = requests.get('https://svc.qri.jp/jpx/english/nkopm/', headers=headers, cookies=cookies)