Search code examples
xmlbashperlxmlstarlet

Fix multiple lines of an xml file without id to separate


I have a large externally generated xml file that has some invalid characters, a backslash in my case. I know what to replace these fields with, so I can gedit a single file and fix it manually. However there are many of these files, all with the same problem. I would like to write a bash script to fix them all.

Problem The problematic section looks like this.

<root>
 <array>
  <dimension> dim="1">gridpoints</dimension>
  <field> a </field>
  <field> b </field>
  <field> c </field>
  <field> \00\00\00 </field>
  <field> \00\00\00 </field>
  <field> \00\00\00 </field>
  <set> 
   All the data 
  </set>
 </array>
</root>

Desired output

<root>
 <array>
  <dimension> dim="1">gridpoints</dimension>
  <dimension> dim="2">morepoints</dimension>
  <dimension> dim="3">evenmorepoints</dimension>
  <field> a </field>
  <field> b </field>
  <field> c </field>
  <field> d </field>
  <field> e </field>
  <field> f </field>
  <set> 
   All the data 
  </set>
 </array>
</root>

Fix so far I have already found a way to remove the offending backslashes using perl, but then I can't figure out how to edit the fields individually as the below code gets the desired solution, but with each field having entry "a"

#!/bin/bash
perl -CSDA -pe'
   s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
' file.xml > temp.xml
xmlstarlet ed -u "/root/array/field" -v "a" temp.xml > file_fixed.xml

I will also gladly take any advice on how to do this more efficiently. Thank you.

Edit As requested by zdim, I have added an example that is more representative of the full file I am dealing with.

<root>
 <path1>
  <array>
   <dimension> dim="1">gridpoints</dimension>
   <field> a </field>
   <field> b </field>
   <field> c </field>
   <field> \00\00\00 </field>
   <field> \00\00\00 </field>
   <field> \00\00\00 </field>
   <set> 
    All the data 
   </set>
  </array>
 </path1>
 <path2>
  <array>
   <dimension> dim="1">gridpoints</dimension>
   <field> Behaves Correctly </field>
  </array>
 </path2>
</root>

It should be noted that I receive these files as output from another program and then need to fix them before feeding them into the next. I am no where near experienced with xml, which is why I may have missed some obvious solutions.


Solution

  • Use a proper XML parser.

    With XML::LibXML, one way

    use warnings;
    use strict;
    use feature 'say';
    
    use XML::LibXML;
    
    my $filename = shift // die "Usage: $0 file.xml\n";  #/ fix syntax hilite
    
    my $doc = XML::LibXML->load_xml(location => $filename);
    
    # Remove unwanted nodes
    foreach my $node ($doc->findnodes('//field')) { 
        #say $node->toString;   
        if ($node->toString =~ m{\\00\\00\\00}) {
            say "Removing $node";
            $node->parentNode->removeChild($node);
        }   
    }
    
    # Add desired new nodes (right after the last <field> node)
    my $last_field_node = ( $doc->findnodes('//field') )[-1];
    my $field_node_name = $last_field_node->nodeName;
    my $parent = $last_field_node->parentNode;
    
    for ("E".."F") {
        my $new_elem = $doc->createElement( $field_node_name );
        $new_elem->appendText($_);
        $parent->insertAfter($new_elem, $last_field_node);
    }
    
    # Add other nodes (like the mentioned "dimension") the same way
    
    print $doc->toString;
    

    I use a basic regex to recognize a mode to remove, as given in the example. Please adjust the code as suitable to your actual input.

    This adds new nodes after the last <field> node. But if we need to add right after the removed nodes, while there may be yet further <field> nodes, then first add after the last <field> node with that need be removed and only then remove them.

    Or, perhaps you simply need to replace content of <field> nodes with '\00\00\00'

    my @replacements = "AA" .. "ZZ";  # li'l list of token replacements 
    
    foreach my $node ($doc->findnodes('//field')) { 
        if ($node->toString =~ m{\\00\\00\\00}) {
            say "Change $node -- remove child (text) nodes, add new";
            $node->removeChildNodes;
            $node->appendText(shift @replacements);
        }
    }
    

    An element's "value" is really a text node, which has a value. Instead of replacing that (text-child-node's) value directly it is better to drop (all) element's (text)-child-nodes and then add the desired new one.

    This code then takes care of \00\00\00 if those need be simply replaced, drawing from some list of replacements. To also add <dimension> nodes use insertAfter as above.

    There are modules for prettier printing, like XML::LibXML::PrettyPrint


    With Mojo::DOM, one way

    use warnings;
    use strict;
    use feature 'say';
    
    use Path::Tiny;  # convenience, for "slurp"-ing a file
    use Mojo::DOM;
    
    my $filename = shift // die "Usage: $0 file.xml\n";  #/ fix syntax hilite
    
    my $dom = Mojo::DOM->new( path($filename)->slurp );
    # my $dom = Mojo::DOM->new->xml(1)->parse(path($filename)->slurp);
    
    # Remove unwanted, by filtering them first
    $dom->find("field")
        -> grep( sub { $_->text =~ m{\\00\\00\\00} } )
        -> each( sub { $_[0]->remove } );
    
    # Or directly while iterating
    # $dom->find("field")->each(
    #     sub { $_[0]->remove if $_[0]->text =~ m{\\00} } );
    
    # Add new ones, after last 'field'
    foreach my $content ("E".."F") {
        my $tag = $dom->new_tag('field', $content);
        $dom->find('field')->last->append($tag);
    }
    
    say $dom;
    

    Again, please adjust to the actual document structure.

    An example. If new field nodes need be added right after the field nodes to be deleted (and not after some other field nodes further down), one way would be to first add after those nodes, while we can still identify those places, and only then delete them.

    # Add new ones, after last 'field' that has \00\00\00 text in it
    foreach my $content ("E".."F") {
        my $tag = $dom->new_tag('field', $content);
        $dom->find('field')->grep(sub { m{\\00\\00\\00} })->last->append($tag);
    }
    
    # Only now remove those 'field' nodes with \00\00\00
    $dom->find("field")->each( 
        sub { $_[0]->remove if $_[0] =~ m{\\00\\00\\00} } );
    

    With this library it is also easy to replace content of a node if that is desired (rather then add-and-remove).