Search code examples
databaseapachesearchsolrdismax

Dismax solr query parser working very poorly


I have a very large database of 4.5M documents. When using the default query parser, the document I want to find appears in the results as it should.

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"\"I predict a riot\"",
      "rows":"1"}},
  "response":{
    "numFound":15,"start":0,"docs":[
      {
        "artist":"Kaiser Chiefs",
        "text":"<p>Oh, watchin' the people get lairy<br>It's not very pretty, I tell thee<br>Walkin' through town is quite scary<br>And not very sensible either<br>A friend of a friend he got beaten<br>He looked the wrong way at a policeman<br>Would never have happened to Smeaton<br>An old Leodiensian<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot<br><br>Oh, I try to get to my taxi<br>A man in a tracksuit attacks me<br>He said that he saw it before me<br>Wants to get things a bit gory<br>Girls scrabble round with no clothes on<br>To borrow a pound for a condom<br>If it wasn't for chip fat, they'd be frozen<br>They're not very sensible<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot<br><br>And if there's anybody left in here<br>That doesn't want to be out there<br><br>Ow!<br><br>Oh, watchin' the people get lairy<br>It's not very pretty, I tell thee<br>Walkin' through town is quite scary<br>Not very sensible<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot<br><br>And if there's anybody left in here<br>That doesn't want to be out there<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot</p>",
        "_ts":6341730138387906561,
        "title":"I predict a riot",
        "id":"redacted"}]
  }}

However, when I switch to the DisMax query handler using all the attached parameters, this is what I get:

{
  "responseHeader": {
  "status": 0,
  "QTime": 1,
  "params": {
    "q": "\"I predict a riot\"",
    "defType": "dismax",
    "ps": "0",
    "qf": "text",
    "echoParams": "all",
    "pf": "text^5",
    "wt": "json"
  }
},
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  }
}

Nothing... If I remove the quotes, it finds some very irrelevant results (songs by an artist called "I"). In case it isn't clear "I predict a riot" is present inside the text field of this document. Several times even.

I'm a Solr newbie and I don't understand what is wrong with this query. I tried changing qf and pf to "artist text title" but nothing.

Ideally the goal is to find matches in all three fields, with a huge bonus if all words are found in the same order in the title, the artist or the text.. But even this simple test doesn't seem to work. :-/

Thanks!

Edit: With these params

"params": {
"q": "I predict a riot",
"defType": "dismax",
"qf": "text artist title",
"echoParams": "all",
"pf": "text^5",
"rows": "100",
"wt": "json"
}

which is giving me this debug query:

"debug": {
"rawquerystring": "I predict a riot",
"querystring": "I predict a riot",
"parsedquery": "(+(DisjunctionMaxQuery((text:I | title:I | artist:I)) DisjunctionMaxQuery((text:predict | title:predict | artist:predict)) DisjunctionMaxQuery((text:a | title:a | artist:a)) DisjunctionMaxQuery((text:riot | title:riot | artist:riot))) DisjunctionMaxQuery(((text:I predict a riot)^5.0)))/no_coord",
"parsedquery_toString": "+((text:I | title:I | artist:I) (text:predict | title:predict | artist:predict) (text:a | title:a | artist:a) (text:riot | title:riot | artist:riot)) ((text:I predict a riot)^5.0)",
"QParser": "DisMaxQParser",
"altquerystring": null,
"boostfuncs": null
}

I'm getting awful results, i.e. an artist called "I" - but not the kaiser chiefs song which has the query in the title and several times in the text.

Definitions:

 <field name="title" type="string" indexed="true" stored="true"/>
 <field name="artist" type="string" indexed="true" stored="true"/>   
 <field name="text" type="string" indexed="true" stored="true"/>

Solution

  • A string field only matches on the exact value of the field (meaning both capitalization and whitespace, etc.).

    To achieve the kind of match you're expecting, you'll want to have a text field instead. The text_general / text_en field in the example schema might be usable, at least as a starting point, but you might want to tune exactly what the field does based on how you want to query the field. If you don't have synonyms or don't want stopwords removed, remove those lines and only keep the tokenizer and the lowercase filter:

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        <filter class="solr.LowerCaseFilterFactory"/>
    </fieldType>
    

    You'll need to reindex the data after changing the field type.

    But I do have a field in qf that has the complete sentence? Yes. But the dismax query parser tokenizes the input according to its own rules, and then creates a new, internal query based on these rules. You can see that it expands the query string to a long list of ORs, where each term is searched for separately. Since there are no tokens indexed matching these terms by themselves, you get no hits.

    If you had used the edismax query parser, which supports the lucene query syntax as well, you could have used title:"I predict a riot" to at least get one hit, but it still wouldn't behave as you expected, just get the one document that you have that matches title character for character.