I am currently using perl HTML::Strip to extract text from my HTML file, however i have run into a minor problem with HTML specific spaces ie the " ". For some reason HTML::Strip->parse() doesnt seem to work in this instance. I know i can run the replace command later on. But i was checking to see if there another way i can accomplish this by tweaking the new() constructor? Thanks in advance
Perl Code:
my $hs = HTML::Strip->new();
my $line = join('',@htmlSource);
my $clean_text = $hs->parse( $line );
push @processedLines, grep { /\S/ } split (/\n/,$clean_text);
foreach my $f ( @processedLines ) {
print "$f\n";
}
Sample Output:
CBD_UnitTest
MtrTempEst
MtrTempEst_Init1 (C1-Coverage: 100.00 %, 1 out of 1 Testcases passed)
LeadLagFilt (C1-Coverage: 100.00 %, 1 out of 1 Testcases failed)
AssMechFiltInit (C1-Coverage: 100.00 %, 1 out of 1 Testcases passed)
Sample Dataset:
<table bgcolor="white" width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td align="center">
<table width="100%" cellspacing="0" cellpadding="1" bgcolor="white" border="0">
<tr bgcolor="#dcdcdc">
<td width="1%" bgcolor="white">
<img border="0" src="pictures/batch_module_notok.jpg"/>
</td>
<td colspan="3" width="1%">
<font face="tahoma" size="-2" color="black">
CBD_UnitTest
</font>
</td>
<td width="1%">
</td>
<td width="1%">
</td>
<td width="1%">
<img border="0" src="pictures/batch_check_notok.gif"/>
</td>
</tr>
<tr bgcolor="white">
<td width="1%" bgcolor="white">
</td>
<td width="1%" bgcolor="white">
<img border="0" src="pictures/batch_module_notok.jpg"/>
</td>
<td colspan="2">
<font face="tahoma" size="-2" color="black">
MtrTempEst
</font>
</td>
<td width="1%">
</td>
<td width="1%">
</td>
<td width="1%">
<img border="0" src="pictures/batch_check_notok.gif"/>
</td>
</tr>
<tr bgcolor="#dcdcdc">
<td width="1%" bgcolor="white">
</td>
<td width="1%" bgcolor="white">
</td>
<td width="1%" bgcolor="white">
<img border="0" src="pictures/batch_ok.jpg"/>
</td>
<td>
<a href="#CBD_UnitTest:MtrTempEst:ts_MtrTempEst_Init1"><font face="tahoma" size="-2" color="black">
MtrTempEst_Init1 (C1-Coverage: 100.00 %, 1 out of 1 Testcases passed)
</font></a>
</td>
<td width="1%">
</td>
<td width="1%">
</td>
<td width="1%">
<img border="0" src="pictures/batch_check_ok.gif"/>
</td>
</tr>
<tr bgcolor="#FF0000">
<td width="1%" bgcolor="white">
</td>
<td width="1%" bgcolor="white">
</td>
<td width="1%" bgcolor="white">
<img border="0" src="pictures/batch_notok.jpg"/>
</td>
<td>
<a href="#CBD_UnitTest:MtrTempEst:ts_LeadLagFilt"><font face="tahoma" size="-2" color="white">
<b>LeadLagFilt (C1-Coverage: 100.00 %, 1 out of 1 Testcases failed)</b>
</font></a>
</td>
<td width="1%">
<a name="LeadLagFilt_0"></a>
</td>
<td width="1%">
</td>
<td width="1%">
<img border="0" src="pictures/batch_check_notok.gif"/>
</td>
</tr>
<tr bgcolor="#dcdcdc">
<td width="1%" bgcolor="white">
</td>
<td width="1%" bgcolor="white">
</td>
<td width="1%" bgcolor="white">
<img border="0" src="pictures/batch_ok.jpg"/>
</td>
<td>
<a href="#CBD_UnitTest:MtrTempEst:ts_AssMechFiltInit"><font face="tahoma" size="-2" color="black">
AssMechFiltInit (C1-Coverage: 100.00 %, 1 out of 1 Testcases passed)
</font></a>
</td>
<td width="1%">
</td>
<td width="1%">
</td>
<td width="1%">
<img border="0" src="pictures/batch_check_ok.gif"/>
</td>
</tr>
</table>
</td>
</tr>
</table>
Figured out the answer from the link to HTML::Entities above. Thx @edibleEnergy
use HTML::Strip;
use HTML::Entities;
my $hs = HTML::Strip->new();
my $line = join('',@htmlSource);
_decode_entities($line, { nbsp => "" }, 1);
my $clean_text = $hs->parse( $line );
push @processedLines, grep { /\S/ } split (/\n/,$clean_text);
foreach my $f ( @processedLines ) {
print "$f\n";
}
I understand that we could just use the simple replace here (ie s/\ //g) But the above example works for instance with or without the ";" at the end. Please check the link provided in the @edibleEnergy's answer.