Search code examples

Computing XML document similarity based on tags

As a method for computing similarity between XML documents (usually several but in this case, two ones), tag-based similarity computation has several applications. Now, how to implement such a method using XSLT.

I think it in this way: Extract tags and list them for both documents. Next, check for exact/partial matching between two lists.

In this regard, does XSLT provide any function/operation for comparing strings (tags). Any idea on the concept and implementation is welcomed.

Simple Example:

For these XML docs (portion of them, of course),

<book id="bk101">
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <description>An in-depth look at creating applications 
      with XML.</description>

and this one,

      <authorname>Ralls, Kim</authorname>
      <booktitle>Midnight Rain</booktitle>
      <abstract>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</abstract>

Both docs have six elements (tags), among them genre appeared in both, title is similar to booktitle, author with authorname and publish_date with date. So, these two are similar. (1 exact matching, 3 partial matching)


  • Assuming XSLT 2.0 the following takes the first XML document as its input and the second document's URL as a parameter and then outputs for each element name in the first document a list of names that are contained or contain the name in the second:

    <xsl:output method="text"/>  
    <xsl:param name="doc2-url" as="xs:string" select="'test2015012102.xml'"/>
    <xsl:variable name="doc2" as="document-node()" select="doc($doc2-url)"/>
    <xsl:variable name="doc2-names" as="xs:string*" select="distinct-values($doc2//*/local-name())"/>
    <xsl:template match="/">
      <xsl:value-of select="for $name in distinct-values(//*/local-name())
                            return concat($name, ': ', string-join($doc2-names[contains($name, .) or contains(., $name)], ', '))"

    So for your sample the output is

    book: books, booktitle
    author: authorname
    title: booktitle
    genre: genre
    publish_date: date