Search code examples
javascriptscreen-scrapinggreasemonkey

Sending Source Code to an External Server


I'm interested in writing a script, preferably one easy to add on to browsers with tools such as Greasemonkey, that sends a page's HTML source code to an external server, where it will later be parsed and useful data would be sent to a database.

However, I haven't seen anything like that and I'm not sure how to approach this task. I would imagine some sort of HTTP post would be the best approach, but I'm completely new to those ideas, and I'm not even exactly where to send the data to parse it (it doesn't make sense to send an entire HTML document to a database, for instance).

So basically, my overall goal is something that works like this (note that I only need help with steps 1 and 2. I am familiar with data parsing techniques, I've just never applied them to the web):

  1. User views a particular page
  2. Source code is sent via greasemonkey or some other tool to a server
  3. The code is parsed into meaningful data that is stored in a MySQL database.

Any tips or help is greatly appreciated, thank you!

Edit: Code

ihtml = document.body.innerHTML;
GM_xmlhttpRequest({
method:'POST',
url:'http://www.myURL.com/getData.php',
data:"SomeData=" + escape(ihtml)
});

Edit: Current JS Log:

Namespace/GMScriptName: Server Response: 200
OK
4
Date: Sun, 19 Dec 2010 02:41:55 GMT
Server: Apache/1.3.42 (Unix) mod_gzip/1.3.26.1a mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_ssl/2.8.31 OpenSSL/0.9.8e-fips-rhel5 PHP-CGI/0.9
Connection: close
Transfer-Encoding: chunked
Content-Type: text/html

Array
(
)

http://www.url.com/getData.php

Solution

  • As mentioned in the comment on your Q, I'm not convinced this is a good idea and personally, I'd avoid any extension that did this like the plague but...

    You can use the innerHTML property available on all html elements to get the HTML inside that node - eg the body element. You could then use an AJAX HTTP(S!) request to post the data.

    You might also want to consider some form of compression as some pages can be very large and most users have better download speeds than upload speeds.

    NB: innerHTML gets a representation of the source code that would display the page in its current state, NOT the actual source that was sent from the web server - eg if you used JS to add an element, the source for that element would be included in innerHTML even though it was never sent across the web.

    An alternative would be to use an AJAX request to GET the current URL and send yourself the response. This would be exactly what was sent to the client but the server in question will be aware the page was served twice (and in some web applications that may cause problems - e.g. by "pressing" a delete button twice)

    one final suggestion would be to simply send the current URL to yourself and do the download on your own servers - This would also mitigate some of the security risks as you wouldn't be able to retrieve the content for pages which aren't public

    EDIT:

    NB: I've deleted much spurious information which was used in tracking down the problem, check the edit logs if you want full details

    PHP Code:

    <?php
        $PageContents = $_POST['PageContents']
    ?>
    

    GreaseMonkey script:

     var ihtml = document.body.innerHTML;
     GM_xmlhttpRequest({
      method:'POST',
      url:'http://example.com/getData.php',
      data:"PageContents=" + escape(ihtml),
      headers: {'Content-type': 'application/x-www-form-urlencoded'}
     });