Search code examples
endeca

Endeca - Where should stemming update files be located?


Chapter 6 of the Endeca MDEX Engine Advanced Development Guide (6.2.2 version) describes how to construct a stemming update XML file to supplement the default Endeca-provided dictionary of stemming terms.

However, the documentation doesn't appear to specify where the new stemming update file should be placed on the filesystem.

Is this XML file supposed to be placed:

  • In the endeca/MDEX/version/conf/stemming folder?
  • In the endeca/MDEX/version/conf/stemming/custom folder?
  • Anywhere on the filesytem, and then include the fully-specified path to the XML file in the Dgidx configuration line with the --stemming-updates flag in DataIngest.xml?

Solution

  • After some trial and error, I got this working.

    The correct approach appears to be to include the fully-specified path to the custom stemming update XML file as the argument for the --stemming-updates parameter for Dgidx.

    Here's the relevant portion of my endeca/apps/MyAppen/config/script/DataIngest.xml:

    <dgidx id="Dgidx" host-id="ITLHost">
      ...
      <args>
        ...
        <arg>--stemming-updates</arg>
        <arg>/full/path/to/endeca/apps/MyAppen/config/script/stemmingExtension.en.xml</arg>
      </args>
    </dgidx>
    

    I found that the --stemming updates and the actual fully-specified path need to be in separate <arg> tags; an error occurs if you try to put them both in the same arg tag separated by a space.

    It's possible that there is a particular folder where you can put the stemming update XML file without having to fully specify the path to the file, but I found that neither the endeca/MDEX/version/conf/stemming nor the stemming/conf folders worked for this. When I tried it, I got an error in the dgidx log like:

    ERROR   08/20/13 13:48:46.810 UTC (1377006526810)       DGIDX   {dgidx,baseline}        InputFileStream can't open file "stemmingExtension.en.xml" for reading        [err=`No such file or directory',errno=2]
    

    I also found that there's an error in the sample XML provided in the Endeca MDEX Engine Advanced Development Guide, 6.2.2 version. The documentation gives the first two lines of the XML file as being:

    <!DOCTYPE WORD_FORMS_COLLECTION SYSTEM "word_forms_collection_updates.dtd">
      <WORD_FORMS_COLLECTION_UPDATES>
    

    This is incorrect. When trying to use the file in this format, this error occurs in the dgidx log:

    FATAL   08/20/13 13:56:33.533 UTC (1377006993533)       DGIDX   {dgidx,baseline}        Errors while parsing word forms updates from file "full/path/to/endeca/apps/MyAppen/config/script/stemmingExtension.en.xml": Errors while trying to parse config stream "full/path/to/endeca/apps/MyAppen/config/script/stemmingExtension.en.xml": Error at file full/path/to/endeca/apps/MPen/config/script/stemmingExtension.en.xml, line 2, column 31; Message: Root element different from DOCTYPE
    

    The fix for this is to change the DOCTYPE line in the XML file to match the root tag, like this:

    <!DOCTYPE WORD_FORMS_COLLECTION_UPDATES SYSTEM "word_forms_collection_updates.dtd">
      <WORD_FORMS_COLLECTION_UPDATES>
    

    I have opened a ticket with Oracle support for this (apparent) bug.