Search code examples
solrlucenepunctuationedismaxdismax

Solr dismax behaviour - punctuation and white space splitting


I have a Solr 4.7.0 instance, with 200 000 documents in the index (one document per file on a filesystem), used by several users. Documents are identified by keywords, that are indexed and stored in one field called "signature_1". During the index, I remove all type of punctuation that I replace with white space (thanks to a ScriptUpdateProcessor), so my keywords are separated with white spaces, both in the index and stored part of the field signature_1 (fieldtype signature).

<fieldType name="signature" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-zA-Z0-9éèàùêâûôîäëöüï])" replacement=" "/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="1000" consumeAllTokens="false"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang\stopwords_fr.txt" enablePositionIncrements="true" />-->
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers_secteurs.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French" />
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-zA-Z0-9éèàùêâûôîäëöüï])" replacement=" "/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang\stopwords_fr.txt" enablePositionIncrements="true" />-->
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French" />
  </analyzer>
</fieldType>

I would like the same behaviour during the query time : if somebody search for

A-B-C

I would like Solr to do the following search (with an OR operator, dismax) :

A B C

So basically, I simply want Solr to search between document's keywords, punctuation beeing removed.

The upper example is working well, but in some case it's not working this way. With a query of

A B-C

Dismax splits the query in

(+(DisjunctionMaxQuery((signature_1:a)) DisjunctionMaxQuery((signature_1:"b c"))) ())/no_coord

and this messes up the relevancy (i.e. the order) of my results. I tried using autoGeneratePhraseQueries="True" but without effect.

So I would like Dismax to always split on whitespace AND punctuation or never do it (results will be the same). Any idea how I can manage to do this (without having to create my Java Dismax class) ?

The following posts are related to my problem :


Solution

  • I finally found a solution, it's a bit "quick and dirty" but it's working : in Velocity, I created a Javascript function to edit the q field, this function is called using the parameter onsubmit of a GET form (it's described in stackoverflow.com/questions/5763055/edit-value-of-a-html-input-form-by-javascript).

    But you need Velocity for this solution, if you are using a Request Handler without velocity (or more generally an HTML interface) it's not working.