I am retrieving a ´ISO-8859-1´ encoded website by using ´LWP::UserAgent´ with the following code.
The problem is, that the special characters are not displayed right, especialy the "€" sign is displayed wrong.
The content encoding is recognized as ´ISO-8859-1´, which is right.
To display the retrieved text I am saving it into a file and open it with Notepag++.
Question: How can I retrieve ´ISO-8859-1´ encoded special characters, in the right way?
#SENDING REQUEST
my $ua = LWP::UserAgent->new();
$ua->agent('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1'); # pretend we are very capable browser
my $req = HTTP::Request->new(GET => $url);
#add some header fields
$req->header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8');
$req->header('Accept-Language', 'en;q=0.5');
$req->header('Connection', 'keep-alive');
$req->header('Host', 'www.url.com');
#SEND
my $response = $ua->request($req);
#decode trial1
print $response->content_charset(); # gives ISO-8859-1 which is right
my $content = $response->decoded_content(); #special chars are displayed wrong
#decode trial2
my $decContent = decode('ISO-8859-1', $response->content());
my $utf8Content = encode( 'utf-8', $decContent ); #special char € is displayed as Â
#decode trial3
Encode::from_to($content, 'iso-8859-1', 'utf8'); #special char € is displayed as  too
#example on writing data to file
open(MYOUTFILE, ">>D:\\encodingperl.html"); #open for write, overwrite
print MYOUTFILE "$utf8Content"; #write text
close(MYOUTFILE);
Same as any other:
my $content = $response->decoded_content();
That said, the iso-8859-1 charset does not include the Euro sign. You probably actually have cp1252. You can fix that as follows:
my $content = $response->decoded_content( charset => 'cp1252' );
Your second problem is that you don't encode your output. This is how you'd do it.
open(my $MYOUTFILE, '>>:encoding(cp1252)', 'D:\\encodingperl.html')
or die $!;
print $MYOUTFILE $content;
Use the encoding that's appropriate for you (e.g. UTF-8
) if it's not cp1252
you want. If you want the original file in the original encoding, use
my $content = $response->decoded_content( charset => 'none' );
and
open(my $MYOUTFILE, '>>', 'D:\\encodingperl.html')
or die $!;
binmode($MYOUTFILE);
print $MYOUTFILE $content;