java - In Heritrix crawler tool how to extract the contents from crawled urls -

- April 15, 2010

am new heritrix tool , able crawl web pages www , want extract contents of crawled urls.

please me one.please.thanks in advance.

 1.first download file  wget http://python.org/ftp/python/3.3.0/python-3.3.0.tgz or higher version root user.  2. change directory installed python  3. example /opt/python3.3/;  4. configure files ./configure --prefix=/opt/python3.3  5.make  6. sudo make install  7. /opt/python3.3/bin/python3  8.opt/python3.3/bin/pyvenv ~/py33  9.source ~/py33/bin/activate  10. wget http://python-distribute.org/distribute_setup.py  11.python distribute_setup.py    12. easy_install pip  13. pip install bottle  14. pip install warcat   15. if installed warcat check whether warcat installed or not.  16. python3 -m warcat --help after enter can see commands like, list,concat,extract etc..  17.python3 -m warcat list example/at.warc.gz  worked me ..enjoy

Search This Blog

Code wiki

java - In Heritrix crawler tool how to extract the contents from crawled urls -

Comments

Post a Comment

Popular posts from this blog

php - Cakephp Not validating data in Form -

sql - Is there any inbuilt stored procedure which will return the output of a query as an XML document..? -

java - RSS Feed Parsing, extracting field value -