Search code examples
pythonpylucene

How to use highlighter in pyLucene?


I have read some tutorials about highlighting search terms in Lucene, and came up with a piece of code like this:

(...)
query = parser.parse(query_string)

for scoreDoc in  searcher.search(query, 50).scoreDocs:
    doc = searcher.doc(scoreDoc.doc)
    filename = doc.get("filename")
    print filename
    found_paraghaph = fetch_from_my_text_library(filename)

    stream = lucene.TokenSources.getTokenStream("contents", found_paraghaph, analyzer);
    scorer = lucene.Scorer(query, "contents", lucene.CachingTokenFilter(stream))
    highligter = lucene.Highligter(scorer)
    fragment = highligter.getBestFragment(analyzer, "contents", found_paraghaph)
    print '>>>' + fragment

But it all ends with an error:

Traceback (most recent call last):
  File "./search.py", line 76, in <module>
    scorer = lucene.Scorer(query, "contents", lucene.CachingTokenFilter(stream))
NotImplementedError: ('instantiating java class', <type 'Scorer'>)

So, I guess, that this part of Lucene insn't iplemented yet in pyLucene. Is there any other way to do it?


Solution

  • I too got similar error. I think this class's wrapper is not yet implemented for Pylucene v3.6.

    You might want to try the following:

    analyzer = StandardAnalyzer(Version.LUCENE_CURRENT)
    
    # Constructs a query parser.
    queryParser = QueryParser(Version.LUCENE_CURRENT, FIELD_CONTENTS, analyzer)
    
    # Create a query
    query = queryParser.parse(QUERY_STRING)
    
    topDocs = searcher.search(query, 50)
    
    # Get top hits
    scoreDocs = topDocs.scoreDocs
    print "%s total matching documents." % len(scoreDocs)
    
    HighlightFormatter = SimpleHTMLFormatter();
    highlighter = Highlighter(HighlightFormatter, QueryScorer (query))
    
    for scoreDoc in scoreDocs:
        doc = searcher.doc(scoreDoc.doc)
        text = doc.get(FIELD_CONTENTS)
        ts = analyzer.tokenStream(FIELD_CONTENTS, StringReader(text))
        print doc.get(FIELD_PATH)
        print highlighter.getBestFragments(ts, text, 3, "...")
        print ""
    

    Please note that we create token stream for each item in the search result.