Search code examples
comparisonsequencexslt-2.0

Compare the accuracy of a single XSLT 2.0 sequence against a database of XSLT 2.0 sequences


I'm trying to do an accuracy match of one sequence of recipe ingredients against a recipe ingredients database in XML. For example:

<Ingredients list="salt, rice, eggs, flour, water">

The above represents the source comparison with 5 ingredients that I can tokenize by comma.

I have a database of ingredients in XML in similar format:

<Ingredients list="salt">
<Ingredients list="salt, sugar">
<Ingredients list="salt, rice, eggs">
<Ingredients list="salt, rice, eggs, flour">
<Ingredients list="salt, rice, flour, water">

The accuracy is rarely going to be 100%, but I'd like to compare my source tokenized list against all of the database ingredients lists (tokenized also) that contain at least one ingredient to try and find the closest accuracy. In this case, the closest match is the last item in the database, which match 4 of the 5 source ingredients. Instead of the other preceding matches, that 4 of 5 would be my favored choice for accuracy.

So far I've mainly been able to read the source ingredient list as well as the target database set, but am stuck working out good logic on how to do a comparison that is not overly complex.

Given an XML structure:

<Ingredient list="salt, rice, eggs, flour, water">

I use:

<xsl:variable name="sourceIngredients" select="tokenize(@list,',')"/>

to produce a sequence:

"salt","rice","eggs","flour","water"

To get to the larger data set, I've tried a loop but this seems needlessly expensive on the system:

<xsl:for-each select="$data/Ingredients[contains(@list,$sourceIngredients)]">

because each source term is compared against about 3500 records to create a subset where some of the terms match, but an unknown amount. Once getting the subset of ingredients in the database, I was doing the same tokenizing of each line, but realized the comparison is more complicated.

Appreciate any insight.


Solution

  • I think you can do e.g.

      <xsl:param name="ingredients-db">
    <Ingredients list="salt"/>
    <Ingredients list="salt, sugar"/>
    <Ingredients list="salt, rice, eggs"/>
    <Ingredients list="salt, rice, eggs, flour"/>
    <Ingredients list="salt, rice, flour, water"/>    
      </xsl:param>
      
      <xsl:template match="Ingredients">
        <xsl:variable name="ingredients" select="tokenize(@list, ',\s*')"/>
        <closest-match 
           input="{@list}" 
           db="{let $max :=
                 max(
                   $ingredients-db/Ingredients ! count(tokenize(@list, ',\s*')[. = $ingredients])
                 ) 
                return 
                  string-join($ingredients-db/Ingredients[count(tokenize(@list, ',\s*')[. = $ingredients]) = $max]/@list, ' | ')}"/>
      </xsl:template>
    

    which then gives e.g.

    <closest-match input="salt, rice, eggs, flour, water" db="salt, rice, eggs, flour | salt, rice, flour, water"/>
    

    for the input <Ingredients list="salt, rice, eggs, flour, water">.