Search code examples
solr

New synonyms aren't being used


I'm having some difficulty in getting new synonyms to work with SOLR. What's odd is that the sample entries in the sysnonyms.txt file that comes with the distribution work. Anything new that I add does not.

For instance, synonyms.txt had the following example:

GB,gib,gigabyte,gigabytes

I am then querying a field call "subject" using one of the above terms.

subject:gb

subject:gib

etc...

Regardless of which of those terms I use in my query, I get the same result set back as expected.

Next, I added the following line to synonyms.txt:

ibm, i.b.m., international business machine

And I made sure that in schema.xml, the fieldtype text_general (the fieldtype used by the field "subject") has the SynonymFilterFactory enabled for index as follows:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
</fieldType>

Finally, since my data is in a mysql database, I then re-imported all the data with a dataimport, assuming this is what I need to do to reindex.

However, while a query for subject:ibm" returns multiple results, a query for "subject:i.b.m." returns nothing.

What am I doing wrong?


Solution

  • Okay, I believe I figured it out and it now seems to be working the way I intended.

    I replaced StandardTokenizerFactory with ClassicTokenizerFactory and also added ClassicFilterFactory to the chain. The net result is that I end up with tokens with the periods stripped out, and that seems to work.

    So, here is my updated definition for text_general:

        <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.ClassicTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ClassicFilterFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.ClassicTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ClassicFilterFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        </analyzer>
    </fieldType>