Search code examples
solrsolrj

How to get word count of SOLR document?


I have the binary content of a pdf file, and I want to upload it to SOLR and index its content:

 ContentStreamUpdateRequest up = new ContentStreamUpdateRequest('/update/extract')
    up.setParam("literal.id", map.id)
    def tmpFile = null
    tmpFile = File.createTempFile(map.id, ".tmp")
    tmpFile.append(binary)
    up.addFile(tmpFile, ".pdf")
    // Do the SOLR stuff here
    def solr = getSolrServer()       
    up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true)
    def response = solr.request(up)
    if (tmpFile) {
        tmpFile.delete()
    }
    return response

When I query SOLR, I can retrieve the SOLR document. How can I get the actual content of the file? Basically I need to find the word count of the document I've uploaded so I was planning to do a size() on the string returned (if that's even possible)....

I'm very new to SOLR so am probably on the wrong track... any assistance greatly appreciated :)


Solution

  • I am assuming you want to count the number of words in the PDF which you have indexed. Make sure that

    1. The entire extracted contents of PDF are indexed into one field.
    2. Make sure this field has atleast a whitespace tokenizer enabled. So that it splits the sentences into words based on whitespace.

    Once you do this you can find the number of words either using facets or Term vector component. The below SO answer might be helpful:

    https://stackoverflow.com/a/26933126/689625