Search code examples
searchsolrphrase

Searching and match count for phrase with Solr


I am using Solr to index documents and now I need to search those documents for an exact phrase and sort the results by the number of times this phrase appears on the document. I also have to present the number of times the phrase is matched back to the user.

I was using the following query (here I am searching by the word SAP):

{
    :params => {
            :wt => "json",
        :indent => "on",
          :rows => 100,
         :start => 0,
             :q => "((content:SAP) AND (doc_type:ClientContact) AND (environment:production))",
          :sort => "termfreq(content,SAP) desc",
            :fl => "id,termfreq(content,SAP)"
    }
}

Of course this is a representation of the actual query, that is done by transforming this hash into a query string at runtime.

I managed to get the search working by using content:"the query here" instead of content:the query here, but the hard part is returning and sorting by the termfreq.

Any ideas on how I could make this work?

Obs: I am using Ruby but this is a legacy application and I can't use any RubyGems, I am using the HTTP interface to Solr here.


Solution

  • I was able to make it work adding a ShingleFilter to my schema.xml:

    In my case I started using SunSpot, so I just had to make the following change:

    <!-- *** This fieldType is used by Sunspot! *** -->
    <fieldType name="text" class="solr.TextField" omitNorms="false">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StandardFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <!-- This is the line I added -->
        <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="true"/>
      </analyzer>
    </fieldType>
    

    After doing that change, restarting Solr and reindexing, I was able to use termfreq(content, "the query here") both on my query (q=), on the returning fields (fl=) and even on sorting (sort=).