Search code examples
javaspringheritrix

In Heritrix crawler tool how to extract the contents from crawled urls


Am new to heritrix tool and now i am able to crawl the web pages from www and now want to extract the contents of the crawled urls.

please help me any one.please.Thanks in Advance.


Solution

  •  1.first download the file  wget http://python.org/ftp/python/3.3.0/Python-3.3.0.tgz or higher version as root user.
     2. change the directory to installed python
     3. example /opt/python3.3/;
     4. configure the files ./configure --prefix=/opt/python3.3
     5.make
     6. sudo make install
     7. /opt/python3.3/bin/python3
     8.opt/python3.3/bin/pyvenv ~/py33
     9.source ~/py33/bin/activate
     10. wget http://python-distribute.org/distribute_setup.py
     11.python distribute_setup.py  
     12. easy_install pip
     13. pip install bottle
     14. pip install warcat 
     15. if successfully installed warcat then check whether your warcat is installed or not.
     16. python3 -m warcat --help after enter then we can see some help commands like, list,concat,extract etc..
     17.python3 -m warcat list example/at.warc.gz
     this is worked for me ..enjoy