Search code examples
web-scrapingweb-crawlerjaunt-api

Jaunt Webcrawler API doesn't treat correctly relative URLs


I implement a crawler that do something like:

repeat
 Visit each page and get all links that have not been visited.
until no new links

The page it is crawling is

https://www.mercadoribeirao.com.br

I'm getting all links like:

<a href="produtos.php?id_sub=104&amp;fruta-nacional" class="new_sub_menu">
                            Fruta Nacional                      </a>

In the first page it get the links correctly like:

https://www.mercadoribeirao.com.br/produtos.php?id_sub=253&espumante-nacional
https://www.mercadoribeirao.com.br/produtos.php?id_sub=245&frances
https://www.mercadoribeirao.com.br/produtos.php?id_sub=246&italiano
https://www.mercadoribeirao.com.br/produtos.php?id_sub=248&nacional
https://www.mercadoribeirao.com.br/produtos.php?id_sub=414&outros

But when it visit subpages the url is concatenate incorrectly:

https://www.mercadoribeirao.com.br/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=197&salgadinho-e-snack
https://www.mercadoribeirao.com.br/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=198&sardinha,-atum-e-cia
https://www.mercadoribeirao.com.br/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=199&sopas-e-cremes

Perhaps because the jaunt trying to complete relative links automatically to concatenate the link at the end of the url and not on the base page.

These new links created are correct for some reason, then there is always new links to be visit and this happens indefinitely and the process never ends.

Is there any workaround to treat this problem?


Solution

  • The bug is relative to URLs that ended with "/". It misinterpreting itself when have to convert relative to absolute links into these documents.

    The bug was recognized and fixed in version 1.1.3 of Jaunt API: http://jaunt-api.com/Jaunt%201.1.3%20Release%20Notes.txt