I have a code that reads the contents of a web page using a url address.
Earlier my code worked well, now there is a problem with the site security certificate. To solve the problem with IE, I completed importing the certificate to trusted sites, and the problem is solved.
But when I run this code:
df = pd.read_html(i,header=0)[0]
I get an error:
Traceback (most recent call last):
File "D:\Distrib\Load_Data_from_Flat_ver_1.py", line 95, in <module>
df = pd.read_html(i,header=0)[0]
File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 915, in read_html
File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 749, in _parse
File "C:\Program Files\Python36\lib\site-packages\pandas\compat\__init__.py", line 385, in raise_with_traceback
raise exc.with_traceback(traceback)
ssl.CertificateError: hostname '' doesn't match 'localhost'
Can anyone help me with this problem?
What is the error
Reading the PSL documentation of ssl package, you will find an example where this specific error occurs.
>>> cert = {'subject': ((('commonName', 'example.com'),),)}
>>> ssl.match_hostname(cert, "example.com")
>>> ssl.match_hostname(cert, "example.org")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/py3k/Lib/ssl.py", line 130, in match_hostname
ssl.CertificateError: hostname 'example.org' doesn't match 'example.com'
When checking Server Common Name the second check fails. It is exactly what happens in your case.
Python path
Referring to the Pandas documentation:
io : str or file-like A URL, a file-like object, or a raw string containing HTML. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https' you might try removing the 's'.
You cannot read from HTTPS with the read_html
To circonvolve this problem, first download the resource using PSL over HTTPS without verifying the SSL context:
from urllib import request
import ssl
context = ssl._create_unverified_context()
response = request.urlopen(url, context=context)
html = response.read()
And then process it with Pandas:
import pandas as pd
df = pd.read_html(html)
Create a Valid Context
As pointed out by @AlastairMcCormack:
context = ssl._create_unverified_context()
should only be used for localhost or testing.
If accessing the resource without verifying the SSL context solves your problem, then it is time to create a valid context (intro, snippets) in order to safely fetch your resource.
Server path
You can also create a new certificate where the Common Name does match the server domain (or its IP). Here localhost
seems come from a development certificate that was sent to production server, this could not work properly.
Anyway this point does not solve the fact than read_html
does not handle HTTPS connections.