python pandas ssl python-requests urllib

Pandas raises ssl.CertificateError when using method read_html for HTTPS resources

I have a code that reads the contents of a web page using a url address.

Earlier my code worked well, now there is a problem with the site security certificate. To solve the problem with IE, I completed importing the certificate to trusted sites, and the problem is solved.

But when I run this code:

df = pd.read_html(i,header=0)[0]

I get an error:

Traceback (most recent call last):
  File "D:\Distrib\Load_Data_from_Flat_ver_1.py", line 95, in <module>
    df = pd.read_html(i,header=0)[0]
  File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 915, in read_html
    keep_default_na=keep_default_na)
  File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 749, in _parse
    raise_with_traceback(retained)
  File "C:\Program Files\Python36\lib\site-packages\pandas\compat\__init__.py", line 385, in raise_with_traceback
    raise exc.with_traceback(traceback)
ssl.CertificateError: hostname '10.89.174.12' doesn't match 'localhost'

Can anyone help me with this problem?

Solution

What is the error

Reading the PSL documentation of ssl package, you will find an example where this specific error occurs.

>>> cert = {'subject': ((('commonName', 'example.com'),),)}
>>> ssl.match_hostname(cert, "example.com")
>>> ssl.match_hostname(cert, "example.org")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/py3k/Lib/ssl.py", line 130, in match_hostname
ssl.CertificateError: hostname 'example.org' doesn't match 'example.com'

When checking Server Common Name the second check fails. It is exactly what happens in your case.

Python path

Referring to the Pandas documentation:

io : str or file-like A URL, a file-like object, or a raw string containing HTML. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https' you might try removing the 's'.

You cannot read from HTTPS with the read_html method.

To circonvolve this problem, first download the resource using PSL over HTTPS without verifying the SSL context:

from urllib import request
import ssl

url="https://example.com/data.html"
context = ssl._create_unverified_context()
response = request.urlopen(url, context=context)
html = response.read()

And then process it with Pandas:

import pandas as pd
df = pd.read_html(html)

Create a Valid Context

As pointed out by @AlastairMcCormack:

context = ssl._create_unverified_context() should only be used for localhost or testing.

If accessing the resource without verifying the SSL context solves your problem, then it is time to create a valid context (intro, snippets) in order to safely fetch your resource.

Server path

You can also create a new certificate where the Common Name does match the server domain (or its IP). Here localhost seems come from a development certificate that was sent to production server, this could not work properly.

Anyway this point does not solve the fact than read_html does not handle HTTPS connections.