Ok, this is just plain wacky. I think the problem may may have been introduced by a recent graphlab update, because I've never seen the issue before, but I'm not sure). Anyway, check this out:
import graphlab as gl
corpus = gl.SArray('path/to/corpus_data')
lda_model = gl.topic_model.create(dataset=corpus,num_topics=10,num_iterations=50,alpha=1.0,beta=0.1)
lda_model.get_topics(num_words=3).print_rows(30)
+-------+---------------+------------------+
| topic | word | score |
+-------+---------------+------------------+
| 0 | Music | 0.0195325651638 |
| 0 | Love | 0.0120906781994 |
| 0 | Photography | 0.00936914065591 |
| 1 | Recipe | 0.0205673829742 |
| 1 | Food | 0.0202932111556 |
| 1 | Sugar | 0.0162560126511 |
| 2 | Business | 0.0223993672813 |
| 2 | Science | 0.0164027313084 |
| 2 | Education | 0.0139221301443 |
| 3 | Science | 0.0134658216431 |
| 3 | Video_game | 0.0113924173881 |
| 3 | NASA | 0.0112188654905 |
| 4 | United_States | 0.0127908290673 |
| 4 | Automobile | 0.00888669047383 |
| 4 | Australia | 0.00854809547772 |
| 5 | Disease | 0.00704245203928 |
| 5 | Earth | 0.00693360028027 |
| 5 | Species | 0.00648700544757 |
| 6 | Religion | 0.0142311765509 |
| 6 | God | 0.0139990904439 |
| 6 | Human | 0.00765681454222 |
| 7 | Google | 0.0198547267697 |
| 7 | Internet | 0.0191105480317 |
| 7 | Computer | 0.0179914269911 |
| 8 | Art | 0.0378733245262 |
| 8 | Design | 0.0223646138082 |
| 8 | Artist | 0.0142755732766 |
| 9 | Film | 0.0205971724156 |
| 9 | Earth | 0.0125386246077 |
| 9 | Television | 0.0102082224947 |
+-------+---------------+------------------+
Ok, even without knowing anything about my corpus, these topics are at least kinda comprehensible, insofar as the top terms per topic are more or less related.
But now if simply save, and reload the model, the topics completely change (to nonsense, as far as can tell):
lda_model.save('test')
lda_model = gl.load_model('test')
lda_model.get_topics(num_words=3).print_rows(30)
+-------+-----------------------+-------------------+
| topic | word | score |
+-------+-----------------------+-------------------+
| 0 | Cleanliness | 0.00468171463384 |
| 0 | Chicken_soup | 0.00326753275774 |
| 0 | The_Language_Instinct | 0.00314506174959 |
| 1 | Equalization | 0.0015724652078 |
| 1 | Financial_crisis | 0.00132675410371 |
| 1 | Tulsa,_Oklahoma | 0.00118899041288 |
| 2 | Batoidea | 0.00142300468887 |
| 2 | Abbottabad | 0.0013474225953 |
| 2 | Migration_humaine | 0.00124284781396 |
| 3 | Gewürztraminer | 0.00147470845039 |
| 3 | Indore | 0.00107223358321 |
| 3 | White_wedding | 0.00104791136102 |
| 4 | Bregenz | 0.00130871351963 |
| 4 | Carl_Jung | 0.000879345016186 |
| 4 | ภ | 0.000855001542873 |
| 5 | 18e_eeuw | 0.000950866105797 |
| 5 | Vesuvianite | 0.000832367570269 |
| 5 | Gary_Kirsten | 0.000806410748201 |
| 6 | Sunday_Bloody_Sunday | 0.000828552346797 |
| 6 | Linear_cryptanalysis | 0.000681188343324 |
| 6 | Clothing_sizes | 0.00066708652481 |
| 7 | Mile | 0.000759081990574 |
| 7 | Pinwheel_calculator | 0.000721971708181 |
| 7 | Third_Age | 0.000623010955132 |
| 8 | Tennessee_Williams | 0.000597449568381 |
| 8 | Levite | 0.000551338743949 |
| 8 | Time_Out_(company) | 0.000536667117994 |
| 9 | David_Deutsch | 0.000543813843275 |
| 9 | Honing_(metalworking) | 0.00044496051774 |
| 9 | Clearing_(finance) | 0.000431699705779 |
+-------+-----------------------+-------------------+
Any idea what could possible be happening here? save
should just pickle the model, so I don't see where the weirdness is happening, but somehow the topic distributions are getting totally changed around in some non-obvious way. I've verified this on two different machines (Linux and Mac). with similar weird results.
EDIT
Downgrading Graphlab from 1.7.1 to 1.6.1 seems to resolve this issue, but that's not a real solution. I don't see anything obvious in the 1.7.1 release notes to explain what happened, and would like this to work in 1.7.1 if possible...
This is a bug in Graphlab create 1.7.1. It has now been fixed in Graphlab Create 1.8.