Search code examples
urlencodecrawler4j

Crawler4J seed url gets encoded and error page is crawler instead of actual page


I am using crawler 4J to crawl user profile on gitHub for instance I want to crawl url: https://github.com/search?q=java+location:India&p=1 for now I am adding this hard coded url in my crawler controller like:

String url = "https://github.com/search?q=java+location:India&p=1"; controller.addSeed(url);

When crawler 4J starts the URL Crawled is : https://github.com/search?q=java%2Blocation%3AIndia&p=1

which gives me error page. What should I do, I have tried giving encoded url but that doesn't work either.


Solution

  • I had to eventually make the slightest of changes to crawler4J source code: File Name: URLCanonicalizer.java Method : percentEncodeRfc3986

    Just commented the first line in this method and I was able to crawl and fetch my results

    //string = string.replace("+", "%2B");

    In my url there was + character and that was being replaced by %2B and I was getting a error page,I wonder why they have specifically replaced + character before encoding the entire URL.