Search code examples
web-scrapingextractscreen-scraping

Are there other ways of implementing a visual web scraper besides loading the data inside a local iframe?


I saw the video for Portia and I was thinking about how such a tool could be implemented. Basically, having a web app, where you would input an URL, it would load ( like if you would have loaded it in a standalone browser tab ), and then you would be able to click on elements in the page, and visually select the data you want to extract.

An idea I currently have is this:

  1. retrieve website content using a headless browser
  2. have a route in a webapp, that would serve the scraped content
  3. embed the route in an iframe in the data selection page, to bypass same origin policy
  4. integrate some JavaScript element inspector library, to be able to visually mark elements meant to be scraped
  5. generate a set of selectors
  6. use the selectors to extract data

I'm interested if there are/what other approaches to handle this, specifically parts 1 to 3.


Solution

  • Consider that the objects that you're going to want to scrape aren't probably active (e.g. they don't respond to clicks or keypresses).

    Even if they do, they probably won't handle meta keys such as Ctrl or Shift.

    So what you could do is to build your system exactly like a proxy, rewriting internal URLs (this you'd need to do regardless), except that you would also inject Javascript code to react to, say, click.

    Then you would need no IFRAME, and simply navigate to www.your-scraper.com, request www.site-to-scrape.com in a form, get assigned a random dab3b19f and get redirected to dab3b19f.your-scraper.com -- and would see a version of www.site-to-scrape.com where all (text?) objects react to Ctrl-Click.

    The user should then be able to move in the site normally, except that holding e.g. the Ctrl key while clicking would not pass the click to the clicked object, but to a handler that could then identify the event target and calculate its CSS path, then pop up a scraping menu in a fixed DIV appended to the DOM on demand, and removed on close.

    This implies that you'd need to detect and hijack several possible Javascript libraries that the site might be loading. If the thing goes on, possibly you would also need to defang some anti-scraping code (e.g. the site might check DOM integrity or try to rewrite handlers to default states).

    At the same time, you could also intercept and record the normal clicks in order to be able to duplicate, up to a point (it depends on how dynamic the site is, and how you can interact with your headless browser). This would allow you to automatically re-navigate the site, changing pages etc., to reach the various objects. You would then end up with a series of selectors and navigational hints that could be used to extract data from the navigated pages:

    start
    click        #menu ul[2] li[1] span
    click        .right.sidebar[1] ul[1] li[5] input[type="checkbox"]
    click        .right.sidebar[1] ul[1] li[5] button
    scrape(TICK) #prices div div[2] div div span p
    scrape(PRIC) #prices div div[2] div div span div span[2] p
    

    The scraping script could then be modified to add, say, loops. This comes later, though.

    You would also end up with something not too unlike Selenium. In fact you might want to consider the possibility of turning Selenium to your purpose.