python-2.7 nltk pickle tokenize text-segmentation

How to merge two PunktSentenceTokenizer pickle files?

I have trained the PunktSentenceTokenizer in NLTK and obtained a pickle file "learnt.pickle":

ccopy_reg
_reconstructor
p0
(cnltk.tokenize.punkt
PunktSentenceTokenizer
p1
c__builtin__
object
p2
Ntp3
Rp4
(dp5
S'_Token'
p6
cnltk.tokenize.punkt
PunktToken
p7
sS'_lang_vars'
p8
g0
(cnltk.tokenize.punkt
PunktLanguageVars
p9
g2
Ntp10
Rp11
I1
bsS'_params'
p12
g0
(cnltk.tokenize.punkt
PunktParameters
p13
g2
Ntp14
Rp15
(dp16
S'sent_starters'
p17
c__builtin__
set
p18
((lp19
tp20
Rp21
sS'collocations'
p22
g18
((lp23
tp24
Rp25
sS'abbrev_types'
p26
g18
((lp27
Vago
p28
aVgca
p29
aVe.g`

I have another pickle file "english.pickle":

ccopy_reg
_reconstructor
p0
(cnltk.tokenize.punkt
PunktSentenceTokenizer
p1
c__builtin__
object
p2
Ntp3
Rp4
(dp5
S'_Token'
p6
cnltk.tokenize.punkt
PunktToken
p7
sS'_lang_vars'
p8
g0
(cnltk.tokenize.punkt
PunktLanguageVars
p9
g2
Ntp10
Rp11
I1
bsS'_params'
p12
g0
(cnltk.tokenize.punkt
PunktParameters
p13
g2
Ntp14
Rp15
(dp16
S'sent_starters'
p17
c__builtin__
set
p18
((lp19
Vamong
p20
aVsince
p21
aVthey
p22
aVindeed
p23
aVsome
p24
aVsales
p25
aVin
p26
aVmoreover
p27
aVyet`

I want to merge these to form a single .pickle file (which must be usuable with the tokenizer of the PunktSentenceTokenizer).
I'm using the following code:

import pickle
my_dict_final = {}
with open('english.pickle', 'rb') as f:
     my_dict_final.update(pickle.load(f))
 with open('learnt.pickle', 'rb') as f:
     my_dict_final.update(pickle.load(f))
out = open("finaldict.pickle","wb")
pickle.dump(my_dict_final, out)
out.close()

But it shows this error:

TypeError: 'PunkSentenceTokenizer' object is not iterable.

I have no idea what this means (I'm not very good with programming)....But I really need a solution....

Solution

You can't "merge two pickle files". Pickling is just a file ("serialization") format, so what you can do with the contents is entirely dependent on the structure of the objects you pickled. In your case, you seem to assume that the (un)pickled objects are dictionaries; but in fact they are PunktSentenceTokenizer objects, including their internal frequency tables. That accounts for the TypeError.

The only viable option would be to study the internals of the PunktSentenceTokenizer, and find out what needs to be merged and whether there is even any meaningful sense of merging two models. But for your (apparent) intended use, I recommend simply concatenating your custom training corpus to a large corpus of normally-punctuated English (e.g., the gutenberg corpus or any other collection of plain-text files), and training a single sentence detection model on the combined data.