Search code examples
pythonnode.jsgoogle-chromeseleniumscreen-scraping

Node.js scraping with chrome-remote-interface


I have been trying to scrape a website protected by Distil Networks, in which using selenium (with Python) would just always fail.

I did a few searches, and my conclusion is that the site can detect you are using Selenium by using some sort of javascript. I then took a loot at chrome-remote-interface, like it is the thing that I want, but then I got stuck.

What I would like to do is to automate following steps:

  1. Open a Chrome instance
  2. Navigate to a page
  3. Run some javascript
  4. Collect data and save to file
  5. Repeat steps 2 - 4

I know that I can open a instance of Chrome for debugging by:

google-chrome --remote-debugging-port=9222

And I can open a console on node by:

chrome-remote-interface -t 127.0.0.1 -p 9222 inspect -r

I can also run simple scripts like

Page.navigate({url:"https://google.com"})
Runtime.evaluate({expression:"1+1"})

But like I can't get the DOMs directly on Node.js as what I could do on the Chrome Developer Tools console. Basically what I want is run scripts on Node like what I could do on the Chrome Developer Tools console.

Also , there are not enough documentation on chrome-remote-interface for scraping. Is there any good links for that?


Solution

  • I know it's has been asked two years ago, but let me write it here for documentation purposes.

    -- Tools of the trade --
    I tried the same technique as you did (used the remote debugger for scraping) but instead of using Python i used Node.js because of it's asynchronous nature, thus making easier to work with websockets that the remote debugger relies on.

    -- Runtime.evaluate --
    One thing i noted is that Runtime.evaluate isn't a valid option for recovering any data if your expression involves asynchronous calls because it returns the result of the calling function and not of the callback function. You have to stick with synchronous expressions.
    Example:

    Array.from(document.getElementByTagName('tr'))
        .map((e)=>e.children[2].innerHTML)
        .filter((e)=>e.length>0)
    

    Other thing is that when your expression returns an array Runtime.evaluate just mention that the expression returned an array but not the array itself! (infuriating i know) I got around it by simply enconding the arrays as JSON strings in the page context then decoding it back to object when it arrives at the Node.js. For example the above expression would need to be:

    JSON.stringify(
        Array.from(document.getElementByTagName('tr'))
            .map((e)=>e.children[2].innerHTML)
            .filter((e)=>e.length>0)
    )
    

    -- Navigation --
    When you trigger a page load by using "Page.navigate", ".click()", ".submit()", "window.location.href=..." or any other way it's important to know when the next page was completely loaded before sending more instructions with Runtime.evaluate. I did the trick asking the debugger to send me the page loading events(look for the Page.enable method in the documentation) then waiting for the "Page.loadEventFired" event before sending more expressions.