I implement a crawler that do something like:
repeat
Visit each page and get all links that have not been visited.
until no new links
The page it is crawling is
https://www.mercadoribeirao.com.br
I'm getting all links like:
<a href="produtos.php?id_sub=104&fruta-nacional" class="new_sub_menu">
Fruta Nacional </a>
In the first page it get the links correctly like:
https://www.mercadoribeirao.com.br/produtos.php?id_sub=253&espumante-nacional
https://www.mercadoribeirao.com.br/produtos.php?id_sub=245&frances
https://www.mercadoribeirao.com.br/produtos.php?id_sub=246&italiano
https://www.mercadoribeirao.com.br/produtos.php?id_sub=248&nacional
https://www.mercadoribeirao.com.br/produtos.php?id_sub=414&outros
But when it visit subpages the url is concatenate incorrectly:
https://www.mercadoribeirao.com.br/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=197&salgadinho-e-snack
https://www.mercadoribeirao.com.br/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=198&sardinha,-atum-e-cia
https://www.mercadoribeirao.com.br/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=388&micoses/calos/produtos.php?id_sub=199&sopas-e-cremes
Perhaps because the jaunt trying to complete relative links automatically to concatenate the link at the end of the url and not on the base page.
These new links created are correct for some reason, then there is always new links to be visit and this happens indefinitely and the process never ends.
Is there any workaround to treat this problem?
The bug is relative to URLs that ended with "/". It misinterpreting itself when have to convert relative to absolute links into these documents.
The bug was recognized and fixed in version 1.1.3 of Jaunt API: http://jaunt-api.com/Jaunt%201.1.3%20Release%20Notes.txt