JSOUP URL connect errors in coldfusion

I have tried to work with JSOUP, Below I've mentioned my code

Application.cfc as

<cfset this.name = "jsoupApp11111">
<cfset this.javasettings = { loadpaths = [#expandPath("./jsoup-1.12.1.jar")#],reloadOnchange = true}>

CFM file as


<cfset jsoupObj = createObject("java","org.jsoup.Jsoup")>
<cfset testURL = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'>
<cfset connectUrlSite = jsoupObj.connect(testURL).get() >
<cfset getUrlSiteBody = connectUrlSite.body() >
 <cfoutput>
    #connectUrlSite.title()#"
 </cfoutput>

<cfloop array="#getUrlSiteBody.select('img')#" index="i">
    <cfoutput>
        #i#
    </cfoutput>
</cfloop>

It's working fine for above Wikipedia site URL, When I am trying to do the same for some other websites I get an error message like Received fatal alert: handshake_failur and some other sites throws error message like PKIX path validation failed error during jsoupObj.connect(testURL) process. I'm not sure which I'm missed or which place I will get more detail about this kind of errors.

Error message After using http instead of https

Every answer is appreciable and helpful.

Thanks in Advance!

Solution

You left out some pertinent information (like your java version), but generally speaking those https errors are caused by JSoup being unable to establish a secure connection with the target server.

Received fatal alert: handshake_failure

I was able to reproduce the error with java 1.8.0_72. Enabling debugging, i.e. -Djavax.net.debug=all, confirmed it's caused by an SNI server_name extension bug. Updating the JVM used by CF to version 1.8.0_141 or later, resolved the issue.

Java 1.8.0_144 (fixed)

*** ClientHello, TLSv1.2
...
Extension signature_algorithms, signature_algorithms: ...
Extension server_name, server_name: [type=host_name (0), value=trycf.com]
***

Java 1.8.0_72 (server_name missing)

*** ClientHello, TLSv1.2
...
Extension signature_algorithms, signature_algorithms: ...
***

HTTP error fetching URL. Status=403

HTTP Status code 403 means the request is forbidden. In this case the request is being rejected because the user-agent value is empty. See the documentation on adding a user agent.

Just keep in mind some sites deliberately reject such requests to prevent screen scraping. So check the site's terms and conditions first, to see if programmatic access is prohibited.

PKIX path validation failed

You need to supply the URL causing the error for us to be more specific, but generally it indicates a problem with missing or invalid certificates. See also How to Resolve Java HTTPS Exceptions