Search code examples
pythonionlpnltkmachine-translation

How to save Python NLTK alignment models for later use?


In Python, I'm using NLTK's alignment module to create word alignments between parallel texts. Aligning bitexts can be a time-consuming process, especially when done over considerable corpora. It would be nice to do alignments in batch one day and use those alignments later on.

from nltk import IBMModel1 as ibm
biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

with open(path + "eng-taq_model.txt", 'w') as f:
    f.write(model.train(biverses, 20))  // makes empty file

Once I create a model, how can I (1) save it to disk and (2) reuse it later?


Solution

  • The immediate answer is to pickle it, see https://wiki.python.org/moin/UsingPickle

    But because IBMModel1 returns a lambda function, it's not possible to pickle it with the default pickle / cPickle (see https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L74 and https://github.com/nltk/nltk/blob/develop/nltk/align/ibm1.py#L104)

    So we'll use dill. Firstly, install dill, see Can Python pickle lambda functions?

    $ pip install dill
    $ python
    >>> import dill as pickle
    

    Then:

    >>> import dill
    >>> import dill as pickle
    >>> from nltk.corpus import comtrans
    >>> from nltk.align import IBMModel1
    >>> bitexts = comtrans.aligned_sents()[:100]
    >>> ibm = IBMModel1(bitexts, 20)
    >>> with open('model1.pk', 'wb') as fout:
    ...     pickle.dump(ibm, fout)
    ...
    >>> exit()
    

    To use pickled model:

    >>> import dill as pickle
    >>> from nltk.corpus import comtrans
    >>> bitexts = comtrans.aligned_sents()[:100]
    >>> with open('model1.pk', 'rb') as fin:
    ...     ibm = pickle.load(fin)
    ... 
    >>> aligned_sent = ibm.align(bitexts[0])
    >>> aligned_sent.words
    ['Wiederaufnahme', 'der', 'Sitzungsperiode']
    

    If you try to pickle the IBMModel1 object, which is a lambda function, you'll end up with this:

    >>> import cPickle as pickle
    >>> from nltk.corpus import comtrans
    >>> from nltk.align import IBMModel1
    >>> bitexts = comtrans.aligned_sents()[:100]
    >>> ibm = IBMModel1(bitexts, 20)
    >>> with open('model1.pk', 'wb') as fout:
    ...     pickle.dump(ibm, fout)
    ... 
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
      File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
        raise TypeError, "can't pickle %s objects" % base.__name__
    TypeError: can't pickle function objects
    

    (Note: the above code snippet comes from NLTK version 3.0.0)

    In python3 with NLTK 3.0.0, you will also face the same problem because IBMModel1 returns a lambda function:

    alvas@ubi:~$ python3
    Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
    [GCC 4.8.2] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import pickle
    >>> from nltk.corpus import comtrans
    >>> from nltk.align import IBMModel1
    >>> bitexts = comtrans.aligned_sents()[:100]
    >>> ibm = IBMModel1(bitexts, 20)
    >>> with open('mode1.pk', 'wb') as fout:
    ...     pickle.dump(ibm, fout)
    ... 
    Traceback (most recent call last):
      File "<stdin>", line 2, in <module>
    _pickle.PicklingError: Can't pickle <function IBMModel1.train.<locals>.<lambda> at 0x7fa37cf9d620>: attribute lookup <lambda> on nltk.align.ibm1 failed'
    
    >>> import dill
    >>> with open('model1.pk', 'wb') as fout:
    ...     dill.dump(ibm, fout)
    ... 
    >>> exit()
    
    alvas@ubi:~$ python3
    Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
    [GCC 4.8.2] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import dill
    >>> from nltk.corpus import comtrans
    >>> with open('model1.pk', 'rb') as fin:
    ...     ibm = dill.load(fin)
    ... 
    >>> bitexts = comtrans.aligned_sents()[:100]
    >>> aligned_sent = ibm.aligned(bitexts[0])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    AttributeError: 'IBMModel1' object has no attribute 'aligned'
    >>> aligned_sent = ibm.align(bitexts[0])
    >>> aligned_sent.words
    ['Wiederaufnahme', 'der', 'Sitzungsperiode']
    

    (Note: In python3, pickle is cPickle, see http://docs.pythonsprints.com/python3_porting/py-porting.html)