I have a Solr MoreLikeThis query that is producing some decidedly non-related results. When I look at the debug for the query, I can see that the query is matching on newline characters.
Here's the query:
mlt?q=is_lesson_id:49029&start=0&rows=3&fl=*,score&wt=json&fq={!tag=sites}sm_sitename:(FCM OR BCM OR CCM)&mlt.interestingTerms=details&mlt.match.include=false&mlt.match.offset=0&mlt.fl=title, body&mlt.mintf=2&mlt.mindf=1&mlt.minwl=4&mlt.boost=true&mlt.qf=title^1000 body&indent=on&debugQuery=on
Here's the explain:
"interestingTerms":[
"body:rabbit",1.0,
"body:bunni",0.8582874,
"body:easter",0.7999738,
"body: ",0.5719101,
"body:ampampnbsp",0.51804715,
"body:nbsp",0.36014518],
"debug":{
"rawquerystring":"is_lesson_id:49029",
"querystring":"is_lesson_id:49029",
"parsedquery":"body:rabbit body:bunni^0.8582874
body:easter^0.7999738
body: ^0.5719101
body:ampampnbsp^0.51804715
body:nbsp^0.36014518",
"parsedquery_toString":"body:rabbit
body:bunni^0.8582874
body:easter^0.7999738
body: ^0.5719101
body:ampampnbsp^0.51804715
body:nbsp^0.36014518",
"explain":{
"p5zqzz/node/681":"\n0.14956066 = (MATCH) product of:\n 0.44868195 = (MATCH) sum of:\n 0.20911716 = (MATCH) weight(body:bunni^0.8582874 in 327), product of:\n 0.5523649 = queryWeight(body:bunni^0.8582874), product of:\n 0.8582874 = boost\n 6.9227004 = idf(docFreq=116, maxDocs=43690)\n 0.09296464 = queryNorm\n 0.3785852 = (MATCH) fieldWeight(body:bunni in 327), product of:\n 1.0 = tf(termFreq(body:bunni)=1)\n 6.9227004 = idf(docFreq=116, maxDocs=43690)\n 0.0546875 = fieldNorm(field=body, doc=327)\n 0.2395648 = (MATCH) weight(body:easter^0.7999738 in 327), product of:\n 0.4799619 = queryWeight(body:easter^0.7999738), product of:\n 0.7999738 = boost\n 6.453766 = idf(docFreq=186, maxDocs=43690)\n 0.09296464 = queryNorm\n 0.49913296 = (MATCH) fieldWeight(body:easter in 327), product of:\n 1.4142135 = tf(termFreq(body:easter)=2)\n 6.453766 = idf(docFreq=186, maxDocs=43690)\n 0.0546875 = fieldNorm(field=body, doc=327)\n 0.33333334 = coord(2/6)\n",
"p5zqzz/node/621":"\n0.14027193 = (MATCH) product of:\n 0.42081577 = (MATCH) sum of:\n 0.21124022 = (MATCH) weight(body:bunni^0.8582874 in 328), product of:\n 0.5523649 = queryWeight(body:bunni^0.8582874), product of:\n 0.8582874 = boost\n 6.9227004 = idf(docFreq=116, maxDocs=43690)\n 0.09296464 = queryNorm\n 0.38242877 = (MATCH) fieldWeight(body:bunni in 328), product of:\n 1.4142135 = tf(termFreq(body:bunni)=2)\n 6.9227004 = idf(docFreq=116, maxDocs=43690)\n 0.0390625 = fieldNorm(field=body, doc=328)\n 0.20957555 = (MATCH) weight(body:easter^0.7999738 in 328), product of:\n 0.4799619 = queryWeight(body:easter^0.7999738), product of:\n 0.7999738 = boost\n 6.453766 = idf(docFreq=186, maxDocs=43690)\n 0.09296464 = queryNorm\n 0.4366504 = (MATCH) fieldWeight(body:easter in 328), product of:\n 1.7320508 = tf(termFreq(body:easter)=3)\n 6.453766 = idf(docFreq=186, maxDocs=43690)\n 0.0390625 = fieldNorm(field=body, doc=328)\n 0.33333334 = coord(2/6)\n",
"p5zqzz/node/1204":"\n0.10955032 = (MATCH) product of:\n 0.32865095 = (MATCH) sum of:\n 0.10455858 = (MATCH) weight(body:bunni^0.8582874 in 432), product of:\n 0.5523649 = queryWeight(body:bunni^0.8582874), product of:\n 0.8582874 = boost\n 6.9227004 = idf(docFreq=116, maxDocs=43690)\n 0.09296464 = queryNorm\n 0.1892926 = (MATCH) fieldWeight(body:bunni in 432), product of:\n 1.0 = tf(termFreq(body:bunni)=1)\n 6.9227004 = idf(docFreq=116, maxDocs=43690)\n 0.02734375 = fieldNorm(field=body, doc=432)\n 0.22409238 = (MATCH) weight(body:easter^0.7999738 in 432), product of:\n 0.4799619 = queryWeight(body:easter^0.7999738), product of:\n 0.7999738 = boost\n 6.453766 = idf(docFreq=186, maxDocs=43690)\n 0.09296464 = queryNorm\n 0.46689618 = (MATCH) fieldWeight(body:easter in 432), product of:\n 2.6457512 = tf(termFreq(body:easter)=7)\n 6.453766 = idf(docFreq=186, maxDocs=43690)\n 0.02734375 = fieldNorm(field=body, doc=432)\n 0.33333334 = coord(2/6)\n"},
"filter_queries":["{!tag=sites}sm_sitename:(FCM OR BCM OR CCM)"],
"parsed_filter_queries":["sm_sitename:FCM sm_sitename:BCM sm_sitename:CCM"]}}
Is this indicative of a misconfiguration on the server, or is the content being indexed improperly, or does the query need to be changed?
Are you indexing HTML? You may want to strip the HTML markup out of the text at the beginning of your filter pipeline. See HtmlStripCharFilter on this page for more info: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory