Search code examples
pythonlucenepylucene

PyLucene Custom TokenStream using PythonTokenStream


I am attempting to build a TokenStream from a Python Sequence. Just for fun I want to be able to pass my own Tokens directly to

pylucene.Field("MyField", MyTokenStream)

I tried to make "MyTokenStream" by...

terms = ['pant', 'on', 'ground', 'look', 'like', 'fool']
stream = pylucene.PythonTokenStream()
for t in terms:
  stream.addAttribute(pylucene.TermAttribute(t))

But unfortunately a wrapper for "TermAttribute" doesn't exist, or for that matter any of the other Attribute classes in lucene so I get a NotImplemented error when calling them.

This doesn't raise an exception - but I'm not not sure if it's even setting the terms.

PythonTokenStream(terms)

Solution

  • The Python* classes are designed to customize behavior by subclassing. In the case of TokenStream, the incrementToken method needs to be overridden.

    class MyTokenStream(lucene.PythonTokenStream):
        def __init__(self, terms):
            lucene.PythonTokenStream.__init__(self)
            self.terms = iter(terms)
            self.addAttribute(lucene.TermAttribute.class_)
        def incrementToken(self):
            for term in self.terms:
                self.getAttribute(lucene.TermAttribute.class_).setTermBuffer(term)
                return True
            return False
    
    mts = MyTokenStream(['pant', 'on', 'ground', 'look', 'like', 'fool'])
    while mts.incrementToken():
        print mts
    
    <MyTokenStream: (pant)>
    <MyTokenStream: (on)>
    <MyTokenStream: (ground)>
    <MyTokenStream: (look)>
    <MyTokenStream: (like)>
    <MyTokenStream: (fool)>
    

    The result of addAttribute could also be stored, obviating the need for getAttribute. My lupyne project has an example of that.