Search code examples
web-scrapingjsouphtml-parsing

Format of How Amazon Wishlist pages are showed


This is a conceptual question rather than a technical one and might seem silly but anyway here goes.

I'm trying to parse a public amazon wishlist given in this link using jsoup. I am able to do that currently.

As you can see in the above link, there are total 9 pages of in that wishlist in the format

   1 2 3 4 5 6 7 .. 9

If there were n pages in a public wishlist then will the format be

   1 2 3 4 5 6 7.. n
      

I need to know how all the pages in an Amazon Wishlist are represented so that I can code accordingly.

Links to various public Amazon Wishlists containing 2,5, 10, 20 pages are welcome to help understand how the pages are shown.


Solution

  • There are few options to know how many pages are in the list:

    1. All the links to the other pages are at the same format: http://www.amazon.com/gp/registry/wishlist/3C96S5RO2A5A9/ref=cm_wl_sortbar_v_page_X/182-3573734-9320732?ie=UTF8&page=X (the page number is X and it appears twice in the URL), so you can loop on X from 2 and on. You should get 200 OK response for all the pages, until you hit a non existing one.
    2. Download the first page and do:

      Elements e = document.select("#wishlistPagination > span:nth-child(1) > div:nth-child(1)");
      String s = e.text();
      

      The string s contains now - ?Previous 1 2 3 4 5 6 7 … 9 Next? so find the number after the ellipsis or before "Next" and you're done.
      EDIT
      In a second thought - if the list contains 7 pages or less, there won't be "next" in the string, so the first method (fetching all the URLs and changing the page number X) is more robust.