Search code examples
pythonredisearch

Get list of words from Redisearch index, sorted by most common occurrence


I have a simple redisearch index which I create in Python with:

>>> from redisearch import Client, TextField
>>> c = Client('common_words')
>>> c.create_index((TextField('body'),))
b'OK'
>>> c.add_document('ibiza', body='kevin paul dad')
b'OK'
>>> c.add_document('england', body='kevin dad')
b'OK'
>>> c.add_document('bank', body='kevin robber')
b'OK'

I can then search a particular word, which works great:

>>> c.search('kevin')
Result{3 total, docs:
   [Document {'id': 'bank', 'payload': None, 'body': 'kevin robber'},
    Document {'id': 'england', 'payload': None, 'body': 'kevin dad'},
    Document {'id': 'ibiza', 'payload': None, 'body': 'kevin paul dad'}
   ]}

Is there a quick way to pull a list of words along with the occurence? I'm aiming for a result like:

{ Result{4 total, counts:
   [ Word { 'word': 'kevin', 'count': 3},
     Word { 'word': 'dad', 'count': 2 },
     Word { 'word': 'paul', 'count': 1 },
     Word { 'word': 'robber', 'count': 1 } ] }

I've looked at this example of how to make a word-count using nltk and zincrby but wondered if there was already a way to get this natively from redisearch.


Solution

  • The only way you can currently do it is using aggregation (https://oss.redislabs.com/redisearch/Aggregations.html). You can ask for all the results, then load the field you interested in, split the sentence by ',' and count for each phrase how many times it appears. The query will look like this:

    127.0.0.1:6379> FT.AGGREGATE idx * LOAD 1 @test APPLY "split(@test, ' ')" as s 
    GROUPBY 1 @s REDUCE count 0 as count
    1) (integer) 4
    2) 1) s
       2) "paul"
       3) count
       4) "1"
    3) 1) s
       2) "kevin"
       3) count
       4) "3"
    4) 1) s
       2) "dad"
       3) count
       4) "2"
    5) 1) s
       2) "robber"
       3) count
       4) "1"
    

    Notice the following: aggregation purpose is to aggregate the result set. There are configuration variables that limits the size of the results set. Once you reach this limit, the search query will not return all the results and the aggregation phase will not process all the results. It is possible to configure some of those variables to increase those limits (like MAXEXPANSIONS for example), but if you intended to process millions of results you will eventually hit those limits (and also your query will take long time to finish). The correct way to go here it to reduce the results set using a more specific query than '*' and after use the aggregation to do the extra calculation on a smaller results set.