Search code examples
xmlxsltpattern-matchingstring-matchingsimilarity

Computing XML document similarity based on tags


As a method for computing similarity between XML documents (usually several but in this case, two ones), tag-based similarity computation has several applications. Now, how to implement such a method using XSLT.

I think it in this way: Extract tags and list them for both documents. Next, check for exact/partial matching between two lists.

In this regard, does XSLT provide any function/operation for comparing strings (tags). Any idea on the concept and implementation is welcomed.

Simple Example:

For these XML docs (portion of them, of course),

<book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
   </book>

and this one,

  <books>
      <authorname>Ralls, Kim</authorname>
      <booktitle>Midnight Rain</booktitle>
      <genre>Fantasy</genre>
      <cost>5.95</cost>
      <date>2000-12-16</date>
      <abstract>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</abstract>
   </books>

Both docs have six elements (tags), among them genre appeared in both, title is similar to booktitle, author with authorname and publish_date with date. So, these two are similar. (1 exact matching, 3 partial matching)


Solution

  • Assuming XSLT 2.0 the following takes the first XML document as its input and the second document's URL as a parameter and then outputs for each element name in the first document a list of names that are contained or contain the name in the second:

    <xsl:stylesheet
      version="2.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
      xmlns:xs="http://www.w3.org/2001/XMLSchema"
      exclude-result-prefixes="xs">
    
    <xsl:output method="text"/>  
    
    <xsl:param name="doc2-url" as="xs:string" select="'test2015012102.xml'"/>
    <xsl:variable name="doc2" as="document-node()" select="doc($doc2-url)"/>
    <xsl:variable name="doc2-names" as="xs:string*" select="distinct-values($doc2//*/local-name())"/>
    
    <xsl:template match="/">
      <xsl:value-of select="for $name in distinct-values(//*/local-name())
                            return concat($name, ': ', string-join($doc2-names[contains($name, .) or contains(., $name)], ', '))"
                    separator="&#10;"/>
    </xsl:template>
    
    </xsl:stylesheet>
    

    So for your sample the output is

    book: books, booktitle
    author: authorname
    title: booktitle
    genre: genre
    price:
    publish_date: date
    description: