Search code examples
perl

How to extract coding between two tags using RegEx in perlscript


I want to extract the coding between <ix:hidden> and </ix:hidden>. Please advise how to extract them

<ix:hidden>
<ix:nonNumeric contextRef="Duration_4_1_2021_To_3_31_2022_IlKaMcQ2N0C41UxW3xo4zg" name="dei:DocumentType" id="Tc_evMsUKdlCEyCZtbxEMZIxg_1_1">DEF 14A</ix:nonNumeric>
<ix:nonNumeric contextRef="Duration_4_1_2021_To_3_31_2022_IlKaMcQ2N0C41UxW3xo4zg" name="dei:AmendmentFlag" id="Tc_nHcapE52UUqrWD0pLkbdag_2_1">false</ix:nonNumeric>
<ix:nonNumeric contextRef="Duration_4_1_2021_To_3_31_2022_IlKaMcQ2N0C41UxW3xo4zg" name="dei:EntityRegistrantName" id="Tc_ZXMW19KSmk2TfvdhMCMr_A_3_1">Walter Hamscher Co Number One</ix:nonNumeric>
<ix:nonNumeric contextRef="Duration_4_1_2021_To_3_31_2022_IlKaMcQ2N0C41UxW3xo4zg" name="dei:EntityCentralIndexKey" id="Tc_MybzAywpbUCU3LEGZc_Ftg_4_1">0000990667</ix:nonNumeric>
</ix:hidden>
use strict;
use warnings;

my @ar_sp;
my $string;
my @ar_out;

# Source File 
my $src = 'iXBRL-Tagged_tm213138-13_def14a.htm';

# open source file for reading
open(FHR, '<', $src);
  
# Destination File
my $des = 'output.txt';

# Open new file to write
open(FHW, '>', $des);
  
  
print("Copying content from $src to $des\n");
@ar_sp = <FHR>;

# Copy data from one file to another.
foreach $string ( @ar_sp ) 
{
    if ($string =~ m/<ix:hidden>(.*?)<\/ix:hidden>/)
    {
        print "Yes" . "\n";
        $string =~ m/<ix:hidden>(.*?)<\/ix:hidden>/;
        print FHW $string;
    }
    
}

# Closing the filehandles
close(FHR);
close(FHW);
   
print "File content copied successfully!\n";

===========================================

Defined Regex not matched in the script



Solution

  • There are good XML parsers out there. Don't very poorly re-invent the wheel.

    use XML::LibXML               qw( );
    use XML::LibXML::XPathContext qw( );
    
    my $doc = XML::LibXML->new->parse_file( 'iXBRL-Tagged_tm213138-13_def14a.htm' );
    
    my $xpc = XML::LibXML::XPathContext->new();
    $xpc->registerNs( ix => 'http://...' );
    
    for my $hidden_node ( $xpc->findnodes( '//ix:hidden', $doc ) ) {
       print $_->toString() for $hidden_node->childNodes();
    }