A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n
to console: one for each page scraped.
However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. How can I find/fix the problem character(s)? Should I even care about fixing it?
The relevant code:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
...
foreach (@urls) {
$mech->get($_);
print FILE $mech->content; #MESSAGE REFERS TO THIS LINE
...
This is on OSX with Perl 5.8.8.
If you want to fix up the files after the fact, then you could pipe them through fix_latin which will make sure they're all UTF-8 (assuming the input is some mixture of ASCII, Latin-1, CP1252 or UTF-8 already).
For the future, you could use $mech->response->decoded_content
which should give you UTF-8 regardless of what encoding the web server used. The you would binmode(FILE, ':utf8')
before writing to it, to ensure that Perl's internal string representation is converted to strict UTF-8 bytes on output.