I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, however this is restricted to a single document. Even in the case of terms that exist within a specified document, the response seems to max out at a certain doc_count (within field_statistics) which makes me doubtful of its accuracy.
Request:
http://myip:9200/clinicaltrials/trial/AVmk-ky6XMskTDwIwpih/_termvectors?term_statistics=true
The document id being used here is "AVmk-ky6XMskTDwIwpih", although the term statistics should not be specific to a document.
Response:
This is what I get for the term "cancer" for one of the fields:
"cancer" : {
"doc_freq" : 5297,
"ttf" : 10587,
"term_freq" : 1,
"tokens" : [
{
"position" : 15,
"start_offset" : 115,
"end_offset" : 121
}
]
},
If I total the ttf for all fields, I get 18915. However, the actual total term frequency for "cancer" is in fact 542829. This leads me to believe that it is limiting the term_vector stats to a subset of documents within the index.
Any advice here would be greatly appreciated.
The reason for the difference in the count is because term vectors are not accurate unless the index in question has a single shard. For indexes with multiple shards, the documents are distributed all over the shards, hence the frequency returned isn't the total but from a randomly selected shard.
Thus, the returned frequency is just a relative measure and not the absolute value you expect. see the Behaviour section. To test this, you can create a single shard index and request the frequency (it should give you the actual total).