Search code examples
perlxml-twig

XML Twig delete tags while not generating white space


I have a huge XML file that contains some elements. I can use a handler to delete all these elements (one gets replaced and all the others removed). But this creates white space in the document. How can I replace the elements with 'nothing' so no spaces in the document are created? To be clear the XML data/files I am dealing with are not optimized and contain multiple lines.

Here is a small example of my problem.

XML data:

<data>
    <Metadata>
        <Name>Test</Name>
        <Company>Acme</Company>
    </Metadata>
    <Info>
        <tag r="1"/>
        <tag r="2"/>
        <tag r="3"/>
        <tag r="4"/>
        <tag r="5"/>
        <tag r="6"/>
        <tag r="7"/>
        <tag r="8"/>
        <tag r="9"/>
        <tag r="10"/>
        <tag r="11"/>
        <tag r="12"/>
        <tag r="13"/>
        <tag r="14"/>
        <tag r="15"/>
        <tag r="16"/>
        <tag r="17"/>
        <tag r="18"/>
        <tag r="19"/>
        <tag r="20"/>
    </Info>
</data>

After the removal I would like to see

<data>
    <Metadata>
        <Name>Test</Name>
        <Company>Acme</Company>
    </Metadata>
    <Info>
        <Newtag>abc</Newtag>
    </Info>
</data>

but instead I get

<data>
    <Metadata>
        <Name>Test</Name>
        <Company>Acme</Company>
    </Metadata>
    <Info>
        
        
        
        
        
        
        
        
        
        
        
        
        
        
        <Newtag>abc</Newtag>
        
        
        
        
        
    </Info>
</data>

So how should my code (below) be modified so the spaces are not created?

use strict;
use warnings;
use XML::Twig;

my $START_Number=1;
my $END_Number=20;
my $fh1A_file='file containing the XML to modify';

my $twig = XML::Twig->new(
            pretty_print => 'none',
            twig_roots => {'tag' => sub{modify_datatagX2_TEST1(@_,$START_Number,$END_Number)}},twig_print_outside_roots => 1);
        $twig->parsefile_inplace($fh1A_file);
        $twig->flush;
#
#
#
sub modify_datatagX2_TEST1 {
    my ( $twig, $datatag, $START_Number, $END_Number) = @_;
    my $Match_Found;
    #                      
    if(int($datatag -> att('r'))>$END_Number || int($datatag -> att('r'))<$START_Number){
        $twig->flush;
    } else {
        $Match_Found=0;
        if(int($datatag -> att('r'))>=$START_Number && int($datatag -> att('r'))<=$END_Number){
            $datatag->delete;
            $Match_Found++;
        }
        print '<Newtag>abc</Newtag>' if $Match_Found==1 and int($datatag -> att('r'))==15;
        $twig->flush if $Match_Found==0;
        #END1:
    }
}

Solution

  • Whitespace between tags shouldn't matter to anything else that uses the XML (And it's not so much that it's created but that it's just not removed), but if it's for cosmetic reasons without wanting to reformat the entire file later... the trick is to only call flush after processing all the tag's by adding a new handler for Info:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    use XML::Twig;
    
    my $START_Number=1;
    my $END_Number=20;
    my $fh1A_file='file containing the XML to modify';
    
    my $twig = XML::Twig->new(
        pretty_print => 'none',
        twig_roots => {
            'Info' => sub { $_[0]->flush },
            'tag' => sub { modify_datatagX2_TEST1(@_, $START_Number, $END_Number) }
        },
        twig_print_outside_roots => 1
        );
    $twig->parsefile_inplace($fh1A_file);
    $twig->flush;
    
    # Note a lot of cleanup here.
    sub modify_datatagX2_TEST1 {
        my ($twig, $datatag, $START_Number, $END_Number) = @_;
        # Only fetch the attribute once; no need for int()
        my $r = $datatag->att('r'); 
        if ($r >= $START_Number && $r <= $END_Number){
            # Don't just blindly print text, replace the element in the XML tree on match
            if ($r == 15) {
                my $newtag = XML::Twig::Elt->new(Newtag => 'abc');
                $newtag->replace($datatag);
            } else {
                $datatag->delete;
            }
        }
    }
    

    This will produce

    <data>
        <Metadata>
            <Name>Test</Name>
            <Company>Acme</Company>
        </Metadata>
        <Info><Newtag>abc</Newtag></Info>
    </data>