Search code examples
javajavascripthtmlhtmlunit

Including variables in URL, returns error page


I'm trying to access a URL in java using HTMLUnit. The way the website I'm using works is that for search results on the website, it draws the first page of search results initially and then changes to the selected page. What I want to do is access a specific page, say, 21. The URL would have to have a variable appended to it (E.g. http://www.thomsonlocal.com/Electricians/UK/#||25). Using it on my browser gets me the 25th page after the first page loads initially and then a method kicks in. (javascript or JQuery?)

I have tried to encode the URL to escape the vertical bar character but that returns an error page on the site.

page = webClient.getPage("http://www.thomsonlocal.com/Electricians/UK/"+URLEncoder.encode("#||" , "UTF-8")+ 21);

My question is what am I doing wrong here? And is there a way to find out what method is being used which the variables in the URL are passed to?


Solution

  • The part after the # is a URI fragment. It does not obey the same escaping rules as form data which is what URLEncoder.encode() does (which means it does not work for URLs, contrary to popular belief).

    What you want is a URI template here (RFC 6570). Sample using this library:

    public static void main(final String... args)
        throws URITemplateException, MalformedURLException
    {
        final URITemplate template 
            = new URITemplate("http://www.thomsonlocal.com/Electricians/UK/#{+var}");
    
        final VariableMap map = VariableMap.newBuilder()
            .addScalarValue("var", "||25")
            .freeze();
    
        System.out.println(template.toURL(map));
    }
    

    This will (correctly) print:

    http://www.thomsonlocal.com/Electricians/UK/#%7C%7C25
    

    Another solution, though not as flexible, is to use the URI constructor:

    final URI uri = new URI("http", "www.thomsonlocal.com",
        "/Electricians/UK/", "||25");
    
    System.out.println(uri.toURL());
    

    This will also print the correct result.