Search code examples
pythonbeautifulsoupscrapyweb-crawlerscrapy-shell

scrapy shell appears different as user interface, portion of the website is not scrapeable


The problem is I cannot scrape portion of website. If I use Chrome devmode, I cannot copy the correct position, either in xpath or selector.

I would get correct path for other tabs or divs for example the body header: body > div.header.home-header > div

whereas when I'm trying to get the tab with information I want, I only got: #htmlContent. If I write it manually it should be: body > div.main.main-top.seach-boxstyle > div > div > div.recommend-product-wrap.produc-text > div > div.recommend-product, but that returned a empty list.

I'm thinking about if someone make the whole session cited so I cannot scrape, or it's another issue. The URL is in Chinese though: http://www.usewealth.com/Product/More.aspx?productDisplay=isArticle

I'm trying to help a company to scrape its own recommended swaps list, whereas the list is not appear in any way.


Solution

  • The problem is that the page renders its content dynamically using JavaScript. Scrapy itself doesn't run JavaScript, it only downloads the HTML source of the page, thus the dynamic content is not there. There are basically two options what to do in such case. Either render the page using some headless browser (e.g. Selenium or Splash) and let Scrapy parse the rendered result. From my experience, I would recommend using Splash because it's more reliable and the integration with Scrapy is seamless using scrapy-splash library.

    The other option is to use browser's developer tools to look if the page doesn't use an API to get the data (which JavaScript then renders on the page). This seems to be the case with the website you are trying to scrape. Looking into Chrome developer tools (network tab and then XHR requests), I see POST requests to this URL:

    http://www.usewealth.com/Action/ProductAJAX.ashx
    

    It returns a JSON response which seems to contain all the needed data and which you can parse using standard json library.