Search code examples
solrlucene

Solr charFilter doesn't allow Regex query [solr 7.6.0]


I'm trying to run a regex query on a solr solr.TextField field. Is this mean to be supported on that field type?

For example, I'm searching curl -g 'http://localhost:8983/solr/shard/select?rows=0&q=body:/hello/' which returns > 0 results.

But when I switch it to curl -g 'http://localhost:8983/solr/shard/select?rows=0&q=body:/h[aeiou]llo/' i get 0 results?

<fieldType name="body_text" class="solr.TextField" positionIncrementGap="100" multiValued="false">
    <analyzer>
      <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^a-zA-Z0-9_@-]+" replacement=" "/>
      <tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
      <filter class="solr.LengthFilterFactory" min="2" max="45"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
      <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
    </analyzer>
</fieldType>

<field name="body" type="body_text" uninvertible="true" indexed="true" stored="false"/>

When I add debugQuery=true, I see that my charFilter replacement is not allowing regex characters through:

"debug":{
    "rawquerystring":"body:/h[aeiou]llo/",
    "querystring":"body:/h[aeiou]llo/",
    "parsedquery":"RegexpQuery(body:/h aeiou llo/)",
    "parsedquery_toString":"body:/h aeiou llo/",
    "explain":{},
    "QParser":"LuceneQParser",

Solution

  • The PatterReplaceCharFilterFactory is removing all special characters, matching your pattern, from the regex. Therefore the "[" and "]" are removed from the query and you are seeing zero documents found. The query h[aeiou]llo becomes h aeiou llo.

    A way to keep both your pattern replace and regex is using the PatternReplaceFilterFactory. Therefore:

    <fieldType name="body_text" class="solr.TextField" positionIncrementGap="100" multiValued="false">
        <analyzer>
          <tokenizer class="solr.WhitespaceTokenizerFactory" rule="java" />
          <filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9_@-]+" replacement=" "/>
          <filter class="solr.LengthFilterFactory" min="2" max="45"/>
          <filter class="solr.LowerCaseFilterFactory"/>
          <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
          <filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
        </analyzer>
    </fieldType>
    

    Just check if this works for your use-case.