Search code examples
xmlperlrenametransformation

How to rename XML element names, decided at run-time?


I've tried a bit myself, search a lot, but could not find a solution how to do that efficiently in Perl (I guess the solution would be somewhat similar to https://stackoverflow.com/a/22119220/6607497):

I have some bad XML input files (i.e. it's claimed to conform to a specific XML content model, but the casing of the element names is inconsistent) that I want to fix if necessary. For that I would have to compare each element name with a list of valid names, and if the bad element name matches a valid element name ignoring the case of the name, then the bad name should be changed to the corresponding valid name.

Like <Bad>...</Bad> (wrong case) being converted to <bad>...</bad> (correct case). In reality it's more complex, of course. Also it's not always true that the bad tags use mixed casing, and the good ones use only lower case; it could be any combination...

I have created a list of all valid element names, but I'm missing (e.g.) how to use XML::Twig to set a handler for "any node" (wanting to use set_tag in the handler to fix the name).

Creating a list of any case permutation of any of the tags would be doable, but it seems inefficient, as only a minor portion of all those possible bad spellings would occur in reality.

Fancy Example

Here is some fancy fun example: Assume the valid element names are:

use constant GOOD_TAGS => qw(ABBA beard Elvis set ZZTop);

And the sample bad input looks like this:

<Set>
  <Beard type="none">
    <elvis />
  </Beard>
  <Beard type="long">
    <ZZtop />
  </Beard>
  <Beard type="mixed">
    <Abba />
  </Beard>
</Set>

Then the fixed output should be:

<set>
  <beard type="none">
    <Elvis />
  </beard>
  <beard type="long">
    <ZZTop />
  </beard>
  <beard type="mixed">
    <ABBA />
  </beard>
</set>

I din't know that you can use compiled reguar expressions as hash keys, but it seems to work, so you might assume this starting scenario as well:

#!/usr/bin/perl
use strict;
use warnings;

use constant GOOD_TAGS => qw(ABBA beard Elvis set ZZTop);
my %fixes;

foreach (GOOD_TAGS) {
    $fixes{qr/^${_}$/i} = $_;
}

my @matchers = keys %fixes;

So all elements matching an item in @matchers should be renamed to the corresponding hash value.


Solution

  • how to use XML::Twig to set a handler for "any node"

    Use _all_ as key in the handlers defined for twig_handlers when creating the XML::Twig object.

    Creating a list of any case permutation of any of the tags would be doable, but it seems inefficient

    Indeed. Instead, I would suggest to normalize the good tags, and check if the normalized tags that you see in the XML match any of the normalized good tags. Here, normalizing can be done by converting to lower case. Something like:

    my %normalized_good_tags = map { lc($_) => $_ } GOOD_TAGS;
    my $bad_tag = "BEaRD";
    my $fixed_bad_tag = $normalized_good_tags{lc $bad_tag};
    

    Putting it all together, this gives us:

    use constant GOOD_TAGS => qw(ABBA beard Elvis set ZZTop);
    my %normalized_good_tags = map { lc($_) => $_ } GOOD_TAGS;
    
    my $twig = XML::Twig->new(
      keep_spaces => 1,
      twig_handlers => {
        _all_ => sub {
          my $corrected_tag = $normalized_good_tags{lc $_->tag};
          if (defined $corrected_tag) {
            $_->set_tag($corrected_tag);
          } # else, the tag doesn't need to be changed
        }
      });
    $twig->parsefile($xml_file);
    $twig->print(1);
    

    You could also simplify

    my $corrected_tag = $normalized_good_tags{lc $_->tag};
    if (defined $corrected_tag) {
        $_->set_tag($corrected_tag);
    }
    

    to

    $_->set_tag($normalized_good_tags{lc $_->tag} // $_->tag);
    

    I don't feel strongly either way, so pick the one you like the most. The shorter one is a bit more expensive (because it will sometimes call $_->set_tag($_->tag)), but this might not matter.