Search code examples
regexxmlperlparsingxmlreader

Generic solution for removing xml declararation using perl


Hi i want remove the declaration in my xml file and problem is declaration is sometimes embed with the root element.

XML looks as follows

Case1:

<?xml version="1.0" encoding="UTF-8"?> <document> This is a document root
<child>----</child>
</document>`

Case 2:

<?xml version="1.0" encoding="UTF-8"?> 
<document> This is a document root
<child>----</child>
</document>`

Function should also work for the case when root node is in next line.

My function works only for case 2..

sub getXMLData {
  my ($xml) = @_;
  my @data = ();
  open(FILE,"<$xml");
  while(<FILE>) {
    chomp;
    if(/\<\?xml\sversion/) {next;}
    push(@data, $_);    
  }
  close(FILE);
  return join("\n",@data);

}

*** Please note that encoding is not constant always.


Solution

  • OK, so the problem here is - you're trying to parse XML line based, and that DOESN'T WORK. You should avoid doing it, because it makes brittle code, which will one day break - as you've noted - thanks to perfectly valid changes to the source XML. Both your documents are semantically identical, so the fact your code handles one and not the other is an example of exactly why doing XML this way is a bad idea.

    More importantly though - why are you trying to remove the XML declaration from your XML? What are you trying to accomplish?

    Generically reformatting XML can be done like this:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use XML::Twig;
    
    my $twig = XML::Twig->new(
        pretty_print  => 'indented',
    );
    $twig->parsefile('your_xml_file');
    $twig->print;
    

    This will parse your XML and reformat it in one of the valid ways XML may be formatted. However I would strongly urge you not to just discard your XML declaration, and instead carry on with something like XML::Twig to process it. (Open a new question with what you're trying to accomplish, and I'll happily give you a solution that doesn't trip up with different valid formats of XML).

    When it comes to merging XML documents, XML::Twig can do this too - and still check and validate your XML as it goes.

    So you might do something like (extending from the above):

    foreach my $file ( @file_list ) {
      my $child = XML::Twig -> new (); 
      $child -> parsefile ( $xml_file );
    
      my $child_doc = $child -> root -> cut;
      $child_doc -> paste ( $twig -> root );
    }
    
    $twig -> print;
    

    Exactly what you'd need to do, depends a little on your desired output structure - you'd need 'wrap' in the root element anyway. Open a new question with some sample input and desired output, and I'll happily take a crack at it.

    As an example - if you feed the above your sample input twice, you get:

    <?xml version="1.0" encoding="UTF-8"?>
    <document><document> This is a document root
    <child>----</child></document> This is a document root
    <child>----</child></document>
    

    Which I know isn't likely to be what you want, but hopefully illustrates a parser based way of XML restructuring.