Search code examples
pythontopic-modelinggraphlab

Saving Graphlab LDA model turns topics into gibberish?


Ok, this is just plain wacky. I think the problem may may have been introduced by a recent graphlab update, because I've never seen the issue before, but I'm not sure). Anyway, check this out:

import graphlab as gl

corpus = gl.SArray('path/to/corpus_data')
lda_model = gl.topic_model.create(dataset=corpus,num_topics=10,num_iterations=50,alpha=1.0,beta=0.1)
lda_model.get_topics(num_words=3).print_rows(30)

+-------+---------------+------------------+
| topic |      word     |      score       |
+-------+---------------+------------------+
|   0   |     Music     | 0.0195325651638  |
|   0   |      Love     | 0.0120906781994  |
|   0   |  Photography  | 0.00936914065591 |
|   1   |     Recipe    | 0.0205673829742  |
|   1   |      Food     | 0.0202932111556  |
|   1   |     Sugar     | 0.0162560126511  |
|   2   |    Business   | 0.0223993672813  |
|   2   |    Science    | 0.0164027313084  |
|   2   |   Education   | 0.0139221301443  |
|   3   |    Science    | 0.0134658216431  |
|   3   |   Video_game  | 0.0113924173881  |
|   3   |      NASA     | 0.0112188654905  |
|   4   | United_States | 0.0127908290673  |
|   4   |   Automobile  | 0.00888669047383 |
|   4   |   Australia   | 0.00854809547772 |
|   5   |    Disease    | 0.00704245203928 |
|   5   |     Earth     | 0.00693360028027 |
|   5   |    Species    | 0.00648700544757 |
|   6   |    Religion   | 0.0142311765509  |
|   6   |      God      | 0.0139990904439  |
|   6   |     Human     | 0.00765681454222 |
|   7   |     Google    | 0.0198547267697  |
|   7   |    Internet   | 0.0191105480317  |
|   7   |    Computer   | 0.0179914269911  |
|   8   |      Art      | 0.0378733245262  |
|   8   |     Design    | 0.0223646138082  |
|   8   |     Artist    | 0.0142755732766  |
|   9   |      Film     | 0.0205971724156  |
|   9   |     Earth     | 0.0125386246077  |
|   9   |   Television  | 0.0102082224947  |
+-------+---------------+------------------+

Ok, even without knowing anything about my corpus, these topics are at least kinda comprehensible, insofar as the top terms per topic are more or less related.

But now if simply save, and reload the model, the topics completely change (to nonsense, as far as can tell):

lda_model.save('test')
lda_model = gl.load_model('test')
lda_model.get_topics(num_words=3).print_rows(30)

+-------+-----------------------+-------------------+
| topic |          word         |       score       |
+-------+-----------------------+-------------------+
|   0   |      Cleanliness      |  0.00468171463384 |
|   0   |      Chicken_soup     |  0.00326753275774 |
|   0   | The_Language_Instinct |  0.00314506174959 |
|   1   |      Equalization     |  0.0015724652078  |
|   1   |    Financial_crisis   |  0.00132675410371 |
|   1   |    Tulsa,_Oklahoma    |  0.00118899041288 |
|   2   |        Batoidea       |  0.00142300468887 |
|   2   |       Abbottabad      |  0.0013474225953  |
|   2   |   Migration_humaine   |  0.00124284781396 |
|   3   |     Gewürztraminer    |  0.00147470845039 |
|   3   |         Indore        |  0.00107223358321 |
|   3   |     White_wedding     |  0.00104791136102 |
|   4   |        Bregenz        |  0.00130871351963 |
|   4   |       Carl_Jung       | 0.000879345016186 |
|   4   |           ภ           | 0.000855001542873 |
|   5   |        18e_eeuw       | 0.000950866105797 |
|   5   |      Vesuvianite      | 0.000832367570269 |
|   5   |      Gary_Kirsten     | 0.000806410748201 |
|   6   |  Sunday_Bloody_Sunday | 0.000828552346797 |
|   6   |  Linear_cryptanalysis | 0.000681188343324 |
|   6   |     Clothing_sizes    |  0.00066708652481 |
|   7   |          Mile         | 0.000759081990574 |
|   7   |  Pinwheel_calculator  | 0.000721971708181 |
|   7   |       Third_Age       | 0.000623010955132 |
|   8   |   Tennessee_Williams  | 0.000597449568381 |
|   8   |         Levite        | 0.000551338743949 |
|   8   |   Time_Out_(company)  | 0.000536667117994 |
|   9   |     David_Deutsch     | 0.000543813843275 |
|   9   | Honing_(metalworking) |  0.00044496051774 |
|   9   |   Clearing_(finance)  | 0.000431699705779 |
+-------+-----------------------+-------------------+

Any idea what could possible be happening here? save should just pickle the model, so I don't see where the weirdness is happening, but somehow the topic distributions are getting totally changed around in some non-obvious way. I've verified this on two different machines (Linux and Mac). with similar weird results.

EDIT

Downgrading Graphlab from 1.7.1 to 1.6.1 seems to resolve this issue, but that's not a real solution. I don't see anything obvious in the 1.7.1 release notes to explain what happened, and would like this to work in 1.7.1 if possible...


Solution

  • This is a bug in Graphlab create 1.7.1. It has now been fixed in Graphlab Create 1.8.