Search code examples
solrdjango-haystack

Find duplicates objects with solr4 and Haystack


I use the facet mode of solr to find duplicates. It works pretty well but I can't figure how to get objects id's.

>>> from haystack.query import SearchQuerySet
>>> sqs = SearchQuerySet().facet('text_string', limit=-1)
>>> sqs.facet_counts()
{
    'dates': {},
    'fields': {
        'text_string': [
            ('the red ballon', 4),
            ('my grand pa is an alien', 2),
            ('be kind rewind', 12),
        ],
    },
    'queries': {}
}

How can I get id of my objects 'the red ballon', 'my grand pa is an alien', etc. , do I have to add id field in the schema.xml of solr ?

I'm expecting something like that:

>>> sqs.facet_counts()
{
    'dates': {},
    'fields': {
        'text_string': [
            (object_id, 'the red ballon', 4),
            (object_id, 'my grand pa is an alien', 2),
            (object_id, 'be kind rewind', 12),
        ],
    },
    'queries': {}
}

EDIT: Added schema.xml and search_indexes.py

schema.xml for solr

...
  <fields>
    <!-- general -->
    <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
    <field name="django_ct" type="string" indexed="true" stored="true" multiValued="false"/>
    <field name="django_id" type="string" indexed="true" stored="true" multiValued="false"/>
    <field name="_version_" type="long" indexed="true" stored ="true"/>
    <dynamicField name="*_i"  type="int"    indexed="true"  stored="true"/>
    <dynamicField name="*_s"  type="string"  indexed="true"  stored="true"/>
    <dynamicField name="*_l"  type="long"   indexed="true"  stored="true"/>
    <dynamicField name="*_t"  type="text_en"    indexed="true"  stored="true"/>
    <dynamicField name="*_b"  type="boolean" indexed="true"  stored="true"/>
    <dynamicField name="*_f"  type="float"  indexed="true"  stored="true"/>
    <dynamicField name="*_d"  type="double" indexed="true"  stored="true"/>
    <dynamicField name="*_dt" type="date" indexed="true" stored="true"/>
    <dynamicField name="*_p" type="location" indexed="true" stored="true"/>
    <dynamicField name="*_coordinate"  type="tdouble" indexed="true"  stored="false"/>

    <field name="text" type="text_en" indexed="true" stored="true" multiValued="false"  termVectors="true" />
    <field name="title" type="text_en" indexed="true" stored="true" multiValued="false"  />

    <!-- Used for duplicate content detection --> 
    <copyField source="title" dest="text_string" />
    <field name="text_string" type="string" indexed="true" stored="true" multiValued="false" />
    <field name="pk" type="long" indexed="true" stored="true" multiValued="false" />

  </fields>

  <!-- field to use to determine and enforce document uniqueness. -->
  <uniqueKey>id</uniqueKey>

  <!-- field for the QueryParser to use when an explicit fieldname is absent -->
  <defaultSearchField>text</defaultSearchField>

  <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
  <solrQueryParser defaultOperator="AND"/>
...

searche_indexes.py

class VideoIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, use_template=True)
    pk = indexes.IntegerField(model_attr='pk')
    title = indexes.CharField(model_attr='title', boost=1.125)

    def index_queryset(self, using=None):
        return Video.on_site.all()

    def get_model(self):
            return Video

Solution

  • Faceting is the arrangement of search results into categories (which are based on indexed terms). Within each category, Solr reports on the number of hits for relevant term, which is called a facet constraint. Faceting makes it easy for users to explore search results on sites such as movie sites and product review sites, where there are many categories and many items within a category.

    Here is good example of it...

    faceting example by Yonik

    faceting example on solr wiki

    In your case you may need to fire a query again to get the id and othere details....