Search code examples
perlxml-parsingazure-storagexml-twig

XML::Twig - identifying blobs that do not contain an element


I am using XML::Twig to parse output of Azure's list-blob REST API.

In particular, I am looking to identify and delete uncommitted orphan blobs, and I am unsure as to how best go about using XML::Twig efficiently to do this. I don't even know where to start.

Ultimately I need to retrieve the <Name> element of the orphaned blobs.

The Azure documentation states:

Uncommitted Blobs in the Response

Uncommitted blobs are listed in the response only if the include=uncommittedblobs parameter was specified on the URI. Uncommitted blobs listed in the response do not include any of the following elements:

Last-Modified
Etag
Content-Type
Content-Encoding
Content-Language
Content-MD5
Cache-Control
Metadata

Therefore, in the following simplified example, you can see an orphan blob called "test" because the <Blob></Blob> block does not contain any of the above elements.

<?xml version="1.0" encoding="utf-8"?>
<EnumerationResults ServiceEndpoint="https://my**account.blob.core.windows.net/"
  ContainerName="testonly">
  <Blobs>
    <Blob>
      <Name>test</Name>
      <Properties>
        <Content-Length>0</Content-Length>
        <BlobType>BlockBlob</BlobType>
        <LeaseStatus>unlocked</LeaseStatus>
        <LeaseState>available</LeaseState>
      </Properties>
    </Blob>
  </Blobs>
  <NextMarker/>
</EnumerationResults>

UPDATE :

Actually, I might have oversimplified. The accepted answer does not appear to work with the below, it prints everything :

<?xml version="1.0" encoding="utf-8"?>
<EnumerationResults ServiceEndpoint="https://my**account.blob.core.windows.net/" ContainerName="testonly">
<Blobs>
    <Blob>
        <Name>data/users/docx</Name>
        <Properties>
            <Last-Modified>Wed, 10 May 2017 20:21:25 GMT</Last-Modified>
            <Etag>0x8D497E221E7A5AF</Etag>
            <Content-Length>125632</Content-Length>
            <Content-Type>application/octet-stream</Content-Type>
            <Content-Encoding/>
            <Content-Language/>
            <Content-MD5/>
            <Cache-Control/>
            <Content-Disposition/>
            <BlobType>BlockBlob</BlobType>
            <LeaseStatus>unlocked</LeaseStatus>
            <LeaseState>available</LeaseState>
        </Properties>
    </Blob>
    <Blob>
        <Name>test</Name>
        <Properties>
            <Content-Length>0</Content-Length>
            <BlobType>BlockBlob</BlobType>
            <LeaseStatus>unlocked</LeaseStatus>
            <LeaseState>available</LeaseState>
        </Properties>
    </Blob>
</Blobs>
<NextMarker/>
</EnumerationResults>

My code :

sub blob_parse {
        my $blob = $_;
        $blob->first_child($_) and return
        for qw( Last-Modified Etag Content-Type Content-Encoding
                Content-Language Content-MD5 Cache-Control Metadata);
        say "orph: ".$blob->first_child('Name')->text;
}

sub parseAndDelete {
        ### ORPHAN
        $twig_handlers = {'Blobs/Blob' => \&blob_parse};
        $twig = new XML::Twig(twig_handlers=>$twig_handlers);
        $twig->parse($message);
}

Solution

  • Just create a handler for Blob, do nothing if any of the elements is present, otherwise print the name. Use the first_child method to inspect the internal structure of a blob.

    #! /usr/bin/perl
    use warnings;
    use strict;
    use feature qw{ say };
    
    use XML::Twig;
    
    my $xml = '...';
    
    my $twig = 'XML::Twig'->new(twig_handlers => {
        Blob => sub {
            my $properties = $_->first_child('Properties');
            $properties->first_child($_) and return
                for qw( Last-Modified Etag Content-Type Content-Encoding
                        Content-Language Content-MD5 Cache-Control Metadata
                      );
            say $_->first_child('Name')->text;
        },
    });
    $twig->parse($xml);