Search code examples
xmlxsltfilterxmlstarletxidel

How to filter a XML file and save filtered results as new XML file using XMLStarlet / XMLint / XSLT / Xidel / Grep


I have been searching for a solution for a very simple task : filter XML results based on multiple criteria and save it as a new XML file. By filtering I mean, to select the values for the output. So, only output the XML whom meet the conditions of the filter. Not just these values of the filter, that doesn't make sense, but all elements of the XML where it matches the filter. Those who don't meet the filter criteria should be removed from the XML. Basically, the filter should remove data from the input and save it to a new and smaller output XML.

But until now I did not manage to find any solution for this. I did see a lot of writing about various tools, such as XMLstarlet (is this abandonned?), XMLint, XSLT and Xidel. But honestly I have no idea where to start. Somehow most questions are about extracting one value out of the XML instead of filtering it - by throwing away the elements that don't match the filter - and make a new, more compact version, of the XML.

This is what I want :

  • Filter XML on specific values - if it contains this value, use the whole output of all elements of the XML that matches that filter.

Take this single XML entry of the an example XML.

<item>
<g_id>5e4e8249-fb12-43e6-8f7e-ccef1b242097</g_id>
<g_title>A Bathing Ape Court Sta Beige</g_title>
<g_description>A Bathing Ape Court Sta Beige</g_description>
<g_google_product_category>Apparel & Accessories > Shoes</g_google_product_category>
<g_condition>new</g_condition>
<g_availability>in stock</g_availability>
<g_price>425.00 EUR</g_price>
<g_gtin>747883771947</g_gtin>
<g_brand>BAPE</g_brand>
<g_identifier_exists>TRUE</g_identifier_exists>
<g_gender>male</g_gender>
<g_age_group>adult</g_age_group>
<g_color>Beige/Light Brown/Blue</g_color>
<g_size>8</g_size>
<g_is_bundle>FALSE</g_is_bundle>
<g_adult>FALSE</g_adult>
<g_custom_label_0>sneakers</g_custom_label_0>
<g_custom_label_1>Other Brands A Bathing Ape</g_custom_label_1>
<g_custom_label_2>0ZXSHM191X30GBGK</g_custom_label_2>
</item>

Imagine the XML has 10.000 entries / records and the g_custom_label_0 can contain different values. Also g_custom_label_2 could be empty. Now I want this entire XML - with those 10.000 entries - to be filtered on these conditions :

g_custom_label_0 = sneakers g_custom_label_3 = not empty

How can I do this using a bash script and save the filtered results to a new XML file?

And if possible... there is a relation between the g_gtin and g_custom_label_2 (many:one). Is it also possible to only save in the new XML the first unique g_custom_label_2? –

The XML file is just too big (200MB - compressed with GZ), I only want to use the relevant stuff, this could be a great way to reduce the XML file size.

Help is really appreaciated!

Mark

PS. I would prefer I could use a bash script after downloading the original XML to filter it and save to a new one.

--- UPDATE ---

Based on the suggestions XSLT should be the way to go. I created this template :


<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="@*|item()">
        <xsl:copy>
            <xsl:apply-templates select="@*|item()" />
        </xsl:copy>
    </xsl:template>
    <xsl:template match="item[g_custom_label_0 = 'sneakers']" />
</xsl:stylesheet>

Just try to get the filtering on the g_custom_label_0 right first. Unfortunately after executing this command :

xsltproc --output output.xml template.xslt test.xml

I get an error :

error xsltCompileIdKeyPattern : expecting 'key' or 'id' or node type compilation error: file template.xslt line 2 element template xsltCompilePattern : failed to compile 'item()'

The original XML looks like this :

<?xml version='1.0' encoding='UTF-8'?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
    <channel>
        <title>Some title</title>
        <link></link>
        <description>Some description</description>
    <item>
        <g:id>74cce63d-1523-460f-b59f-4a42a94be350</g:id>
        <g:title>100 Thieves 2019 Hoodie Cream</g:title>
        <g:description>Released in late 2019, this 100 Thieves 2019 Hoodie Cream is one of the hottest pieces of gaming merch out there. The clean, all white hoodie features a stitched 100 Thieves logo on the chest in black. This hoodie released on November 23, 2019, and sold out immediately upon it's release, generating considerable interest on StockX. It carried an original retail price of $90. Shop now by placing a Bid. Looking to get your gaming setup in check? Shop the coveted PS5 console &lt;a href= "  https://stockx.com/sony-ps5-playstation-5-blu-ray-edition-console-white">here&lt;/a> on StockX.</g:description>
        <g:google_product_category>Apparel &amp; Accessories</g:google_product_category>
        <g:condition>new</g:condition>
        <g:availability>in stock</g:availability>
        <g:price>179.00 GBP</g:price>
        <g:gtin>765208957340</g:gtin>
        <g:brand>100 Thieves</g:brand>
        <g:age_group>adult</g:age_group>
        <g:color>Cream</g:color>
        <g:size>M</g:size>
        <g:size_system>US</g:size_system>
        <g:shipping>
            <g:country>GB</g:country>
            <g:price>11.41 GBP</g:price>
        </g:shipping>
        <g:is_bundle>FALSE</g:is_bundle>
        <g:custom_label_0>streetwear</g:custom_label_0>
        <g:custom_label_1>Other Brands 100 Thieves</g:custom_label_1>
    </item>
    <item>
        <g:id>6576ba88-a874-48b2-a8a9-e4b82db0ac64</g:id>
        <g:title>100% Soft Dumpster Fire Kawaii Galaxy Trash Vinyl Figure</g:title>
        <g:google_product_category>Arts &amp; Entertainment > Hobbies &amp; Creative Arts > Collectibles</g:google_product_category>
        <g:condition>new</g:condition>
        <g:availability>in stock</g:availability>
        <g:price>178.00 GBP</g:price>
        <g:gtin>747883533613</g:gtin>
        <g:brand>100% Soft</g:brand>
        <g:gender>unisex</g:gender>
        <g:age_group>adult</g:age_group>
        <g:size_system>US</g:size_system>
        <g:item_group_id>6060eb7c-a3aa-4b52-a4a3-834b5da1ba22</g:item_group_id>
        <g:shipping>
            <g:country>GB</g:country>
            <g:price>11.41 GBP</g:price>
        </g:shipping>
        <g:is_bundle>FALSE</g:is_bundle>
        <g:custom_label_0>collectibles</g:custom_label_0>
        <g:custom_label_1>Other Artists Other</g:custom_label_1>
    </item>
    <item>
        <g:id>c3885e7e-7cfc-426c-922c-c7fde307da73</g:id>
        <g:title>100% Soft Dumpster Fire Figure Chrome</g:title>

--- UPDATE 2 ---

Made some (bad) progress here. I adjusted the template to this :

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:template match="item[g_custom_label_0('sneaker')]" />
</xsl:stylesheet>

and got a file that has no XML markers in it anymore. Totally wrong. Don't do this..


<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="item[g_custom_label_0 = 'sneakers']">
  <xsl:copy>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>
</xsl:stylesheet>

Leads to the same results.. pfff..

--- UPDATE 3 ---

The suggestion made by Yiztak was perfect in getting rid of the elements that do not match the "sneakers" criteria. Thanks for that.

He proposed the following XSLT template :


<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:g="http://base.google.com/ns/1.0">
    <xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

    <!--remove items where g:custom_label_0 != 'sneakers'-->
    <xsl:template match="item[not(g:custom_label_0='sneakers')]"/>
</xsl:stylesheet>

Next step is to remove duplicates on the g_custom_label_3 values (not present in the XML examples) and remove empty or missing values of the g_custom_label_3. I will do some experiments on this.

<xsl:if test=..."> would be the next step. Have no clue how to combine with previous statements. Any further ideas?

So g:custom_label_3 should be present and have a value to be in the newly exported XML.


Solution

  • Please try the following solution.

    Notable points:

    • Input XML has a namespace. It should be declared and used in the XSLT too.
    • 2nd template is eliminating what is not needed via [not(...)] predicate.

    Input XML

    <?xml version="1.0" encoding="UTF-8"?>
    <rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
        <channel>
            <title>Some title</title>
            <link></link>
            <description>Some description</description>
            <item>
                <g:id>74cce63d-1523-460f-b59f-4a42a94be350</g:id>
                <g:title>100 Thieves 2019 Hoodie Cream</g:title>
                <g:description>Released in late 2019, this 100 Thieves 2019 Hoodie Cream is one of the hottest pieces of gaming merch out there. The clean, all white hoodie features a stitched 100 Thieves logo on the chest in black. This hoodie released on November 23, 2019, and sold out immediately upon it's release, generating considerable interest on StockX. It carried an original retail price of $90. Shop now by placing a Bid. Looking to get your gaming setup in check? Shop the coveted PS5 console &lt;a href= "  https://stockx.com/sony-ps5-playstation-5-blu-ray-edition-console-white">here&lt;/a> on StockX.</g:description>
                <g:google_product_category>Apparel &amp; Accessories</g:google_product_category>
                <g:condition>new</g:condition>
                <g:availability>in stock</g:availability>
                <g:price>179.00 GBP</g:price>
                <g:gtin>765208957340</g:gtin>
                <g:brand>100 Thieves</g:brand>
                <g:age_group>adult</g:age_group>
                <g:color>Cream</g:color>
                <g:size>M</g:size>
                <g:size_system>US</g:size_system>
                <g:shipping>
                    <g:country>GB</g:country>
                    <g:price>11.41 GBP</g:price>
                </g:shipping>
                <g:is_bundle>FALSE</g:is_bundle>
                <g:custom_label_0>streetwear</g:custom_label_0>
                <g:custom_label_1>Other Brands 100 Thieves</g:custom_label_1>
            </item>
            <item>
                <g:id>6576ba88-a874-48b2-a8a9-e4b82db0ac64</g:id>
                <g:title>100% Soft Dumpster Fire Kawaii Galaxy Trash Vinyl Figure</g:title>
                <g:google_product_category>Arts &amp; Entertainment > Hobbies &amp; Creative Arts > Collectibles</g:google_product_category>
                <g:condition>new</g:condition>
                <g:availability>in stock</g:availability>
                <g:price>178.00 GBP</g:price>
                <g:gtin>747883533613</g:gtin>
                <g:brand>100% Soft</g:brand>
                <g:gender>unisex</g:gender>
                <g:age_group>adult</g:age_group>
                <g:size_system>US</g:size_system>
                <g:item_group_id>6060eb7c-a3aa-4b52-a4a3-834b5da1ba22</g:item_group_id>
                <g:shipping>
                    <g:country>GB</g:country>
                    <g:price>11.41 GBP</g:price>
                </g:shipping>
                <g:is_bundle>FALSE</g:is_bundle>
                <g:custom_label_0>sneakers</g:custom_label_0>
                <g:custom_label_1>Other Artists Other</g:custom_label_1>
            </item>
        </channel>
    </rss>
    

    XSLT

    <?xml version="1.0"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:g="http://base.google.com/ns/1.0">
        <xsl:output method="xml" encoding="utf-8" indent="yes" omit-xml-declaration="yes"/>
        <xsl:strip-space elements="*"/>
    
        <xsl:template match="@*|node()">
            <xsl:copy>
                <xsl:apply-templates select="@*|node()"/>
            </xsl:copy>
        </xsl:template>
    
        <!--remove items where g:custom_label_0 != 'sneakers'-->
        <xsl:template match="item[not(g:custom_label_0='sneakers')]"/>
    </xsl:stylesheet>
    

    Output XML

    <rss xmlns:g="http://base.google.com/ns/1.0" version="2.0">
      <channel>
        <title>Some title</title>
        <link/>
        <description>Some description</description>
        <item>
          <g:id>6576ba88-a874-48b2-a8a9-e4b82db0ac64</g:id>
          <g:title>100% Soft Dumpster Fire Kawaii Galaxy Trash Vinyl Figure</g:title>
          <g:google_product_category>Arts &amp; Entertainment &gt; Hobbies &amp; Creative Arts &gt; Collectibles</g:google_product_category>
          <g:condition>new</g:condition>
          <g:availability>in stock</g:availability>
          <g:price>178.00 GBP</g:price>
          <g:gtin>747883533613</g:gtin>
          <g:brand>100% Soft</g:brand>
          <g:gender>unisex</g:gender>
          <g:age_group>adult</g:age_group>
          <g:size_system>US</g:size_system>
          <g:item_group_id>6060eb7c-a3aa-4b52-a4a3-834b5da1ba22</g:item_group_id>
          <g:shipping>
            <g:country>GB</g:country>
            <g:price>11.41 GBP</g:price>
          </g:shipping>
          <g:is_bundle>FALSE</g:is_bundle>
          <g:custom_label_0>sneakers</g:custom_label_0>
          <g:custom_label_1>Other Artists Other</g:custom_label_1>
        </item>
      </channel>
    </rss>