Search code examples
javascriptajaxperlsocketstor

execute JavaScript using Tor network without human interaction


In a nutshell

I want to load html content through the Tor network and execute JavaScript to load additional content through this network via AJAX. This must be done automated by a script that runs on a Linux server without any human interaction. I can't find a combination of tools that enables automated execution of JavaScript that came through the Tor network.

In detail

I want to write an application with this characteristics:

environment

  • run autonomously (without any human interaction)
  • run on a non-GUI ("headless") Linux server (Ubuntu 12.04)

features

  • uses the Tor network to anonymously load web content (html documents, images, ...)
  • execute JavaScript that is embedded in or attached to html documents (to load additional content via AJAX or similar techniques)
  • when everything did load: convert the html document into a DOM tree and extract specific items from that tree.

The environment-constraints forbid the use of a web browser. Everything must be done by programs or scripts. The feature-constraints force to execute JavaScript that doesn't connect directly to the internet, but through the Tor network.

Tor

To use the Tor network I can run a Tor client that provides a socket on my machine. Then I write a Perl script that connects to this socket. The Perl scripts sends http- and https-requests through this socket to the Tor client, who subsequently routs them through the Tor network. All response goes the same way back.

I've tested this, it works fine. But in a Perl script it is really hard to execute JavaScript that comes with the received html documents. I had to write a JavaScript emulator in Perl to make this possible but this is way beyond my available time and beyond my skills.

JavaScript

To execute embedded or attached JavaScript I can use a tool like phantomJS or slimerJS (phantomJS does not work properly on Ubuntu 12.04, so I use slimerJS which offers almost the same features). With this tools I can load html documents and automatically get all JavaScript executed that comes with it, so I also receive all content that is not part of the initially html document but gets loaded later by Ajax or similar techniques. And additionally I easily can analyze the document's DOM tree to extract the items I am interested in.

I've tested this too and it also works fine, but the tools I know (phantomJS and slimerJS) uses their own procedures to connect to the internet. There seems to be no way to tell them to connect to a socket and use it to communicate through it with the internet.

My question

Is there a way to automatically execute Ajax calls through the Tor network?

To me there seems to exist two possible ways:

  1. Get JavaScript code executed within a Perl script. This could be done by a module, But I couldn't find any cpan module that emulates a JavaScript interpreter. Instead of connecting directly to the internet the interpreter should call Perl functions that I have to write.
  2. Force slimerJS (or phantomJS or any other tool) to connect to a socket on localhost and send all requests through this socket. Maybe it is possible to start slimerJS in an environment that pretends to offer direct access to the internet but in fact redirects all communication to the socket of the Tor client?

Solution

  • If you have a Tor client running, you can use the address its listening to for proxy settings. Check the docs for the proxy options you need to pass:

    The proxy type will be SOCKS. Remember you need the address socket is bound to locally.