Search code examples
perlcgi-bin

How to collect corpus from a website using Perl


I am searching for a program that collects the all corpus from a website and writes it to a single text file. I have the following code right now

#!/usr/bin/perl

print "Content-Type: text/html; charset=utf-8\n\n";

use CGI;
use Cwd;
use strict;
$q=new CGI;
$a=$q->param('file');
chomp($a);
print "$a<br>";
my $ftpname="www.kuvempu.com";
system("wget --mirror -p --convert-links -x --reject=gif $ftpname");

But it only gives the .html files of the website. How can I extract only the text from those files and write it to a single text file?


Solution

  • You can do something like the following:

    use strict;
    use warnings;
    use HTML::Strip;
    use LWP::Simple qw/get/;
    
    
    my $html = get shift or die "Unable to get web content.";
    print HTML::Strip->new()->parse($html);
    

    Command-line usage: perl script.pl http://www.website.com > outFile.txt

    outFile.txt will contain the site's corpus.

    Hope this helps!