I am searching for a program that collects the all corpus from a website and writes it to a single text file. I have the following code right now
#!/usr/bin/perl
print "Content-Type: text/html; charset=utf-8\n\n";
use CGI;
use Cwd;
use strict;
$q=new CGI;
$a=$q->param('file');
chomp($a);
print "$a<br>";
my $ftpname="www.kuvempu.com";
system("wget --mirror -p --convert-links -x --reject=gif $ftpname");
But it only gives the .html files of the website. How can I extract only the text from those files and write it to a single text file?
You can do something like the following:
use strict;
use warnings;
use HTML::Strip;
use LWP::Simple qw/get/;
my $html = get shift or die "Unable to get web content.";
print HTML::Strip->new()->parse($html);
Command-line usage: perl script.pl http://www.website.com > outFile.txt
outFile.txt
will contain the site's corpus.
Hope this helps!