Search code examples
pythonmachine-learningnlpldatopic-modeling

How to know the topic from trained data (or predict the topic of new data) using trained topic modelling using OCTIS?


I've trained an LDA for topic modelling using OCTIS. But I don't know how to see the predicted topic for each data input or how to apply/predict my trained model to new data.

This is the code and the output of the trained model:

Input:

# Custom dataset
from octis.dataset.dataset import Dataset
dataset = Dataset()
dataset.load_custom_dataset_from_folder("/content/new")

# Create Model
model = LDA(num_topics=5, alpha=0.1)

# Train the model 
output = model.train_model(dataset)

Output:

>> output
{'topic-word-matrix': array([[0.00030817, 0.00646953, 0.00338882, ..., 0.00030812, 0.00030813,
         0.00030812],
        [0.00041419, 0.00248425, 0.0004141 , ..., 0.0004141 , 0.0004141 ,
         0.0004141 ],
        [0.0002584 , 0.00025837, 0.00025836, ..., 0.00025836, 0.00025836,
         0.00025836],
        [0.00044957, 0.0004495 , 0.00044949, ..., 0.00269659, 0.00269643,
         0.00269655],
        [0.00238244, 0.0003972 , 0.00039719, ..., 0.00039719, 0.00039719,
         0.00039719]], dtype=float32),
 'topics': [['vaksin',
   'sertifikat',
   'aplikasi',
   'pertama',
   'alhamdulillah',
   'pedulilindungi',
   'ada',
   'padahal',
   'terimakasih',
   'nik'],
  ['aplikasi',
   'vaksin',
   'sertifikat',
   'guna',
   'data',
   'udh',
   'bikin',
   'pake',
   'login',
   'nik'],
  ['aplikasi',
   'vaksin',
   'sertifikat',
   'ada',
   'tanggal',
   'tgl',
   'di',
   'belum',
   'coba',
   'sangat'],
  ['covid',
   'vaksin',
   'jadi',
   'hp',
   'ada',
   'aplikasi',
   'mau',
   'kalau',
   'sakit',
   'salah'],
  ['aplikasi',
   'bahasa',
   'lahir',
   'bisa',
   'nx',
   'sertifikat',
   'pakai',
   'di',
   'vaksin',
   'indonesia']],
 'topic-document-matrix': array([[0.06667246, 0.98769081, 0.00190489, 0.00165357, 0.98805857,
         0.00392223, 0.00210558, 0.00219824, 0.0029861 , 0.00170955,
         0.00215115, 0.00160036, 0.00210585, 0.00210572, 0.0030779 ,
         0.00289892, 0.00289916, 0.00307764, 0.00317504, 0.00307748,
         0.00183547, 0.10921329, 0.00160071, 0.98933005, 0.00219851,
         0.49730667, 0.98768848, 0.00194217, 0.00194207, 0.99120653,
         0.00160038, 0.00363727, 0.23678468, 0.98545253, 0.00168113,
         0.0016811 , 0.99349433, 0.00229977, 0.00339057, 0.98769081,
         0.00190489, 0.00165355, 0.98805857, 0.00392223, 0.00210558,
         0.00219824, 0.00298613, 0.00170955, 0.00215115, 0.00160037],
        [0.06667251, 0.00307732, 0.00190493, 0.99338686, 0.00298535,
         0.00392244, 0.00210581, 0.00219835, 0.00298549, 0.00170954,
         0.00215096, 0.00160031, 0.00210574, 0.00210569, 0.00307775,
         0.00289909, 0.0028996 , 0.00307763, 0.0031751 , 0.98769003,
         0.00183529, 0.88495934, 0.00160057, 0.00266775, 0.0021984 ,
         0.00224775, 0.00307769, 0.99223095, 0.99223143, 0.00219834,
         0.64113957, 0.98545176, 0.00219818, 0.00363676, 0.00168091,
         0.00168108, 0.00162644, 0.00229943, 0.00339064, 0.00307732,
         0.00190493, 0.99338692, 0.00298535, 0.00392238, 0.00210581,
         0.00219834, 0.00298549, 0.00170954, 0.00215096, 0.00160031],
        [0.06667244, 0.00307741, 0.00190496, 0.0016533 , 0.00298539,
         0.98431122, 0.99157733, 0.99120724, 0.9880569 , 0.00170964,
         0.00215104, 0.99359852, 0.002106  , 0.99157727, 0.00307766,
         0.00289921, 0.9884032 , 0.0681117 , 0.98729986, 0.00307752,
         0.00183526, 0.0019427 , 0.0016006 , 0.00266733, 0.06295873,
         0.49595022, 0.00307819, 0.00194241, 0.00194218, 0.00219866,
         0.35405946, 0.00363706, 0.75662065, 0.00363694, 0.00168095,
         0.99327528, 0.00162655, 0.00229944, 0.00339072, 0.00307741,
         0.00190496, 0.0016533 , 0.00298541, 0.98431122, 0.99157733,
         0.99120724, 0.9880569 , 0.00170964, 0.00215105, 0.99359864],
        [0.06667253, 0.00307718, 0.00190488, 0.00165313, 0.00298533,
         0.00392208, 0.00210552, 0.00219805, 0.00298561, 0.99316174,
         0.00215088, 0.00160023, 0.00210566, 0.00210558, 0.98768842,
         0.9884038 , 0.00289898, 0.00307778, 0.00317501, 0.00307729,
         0.0018353 , 0.00194239, 0.99359751, 0.00266747, 0.93044597,
         0.00224771, 0.0030776 , 0.00194221, 0.00194201, 0.00219822,
         0.00160026, 0.0036369 , 0.00219826, 0.00363692, 0.99327594,
         0.00168124, 0.00162628, 0.00229925, 0.98643756, 0.00307718,
         0.00190488, 0.00165313, 0.00298533, 0.00392208, 0.00210552,
         0.00219805, 0.00298561, 0.99316174, 0.00215088, 0.00160023],
        [0.73331004, 0.00307727, 0.99238032, 0.00165311, 0.00298531,
         0.00392205, 0.00210574, 0.00219817, 0.00298594, 0.00170954,
         0.99139601, 0.00160052, 0.99157673, 0.00210574, 0.00307827,
         0.00289894, 0.00289911, 0.92265522, 0.00317498, 0.00307761,
         0.99265867, 0.00194224, 0.00160059, 0.00266738, 0.00219837,
         0.00224767, 0.00307806, 0.00194227, 0.0019423 , 0.00219831,
         0.00160033, 0.00363697, 0.00219824, 0.00363684, 0.00168108,
         0.00168125, 0.00162644, 0.99080211, 0.00339047, 0.00307727,
         0.99238032, 0.00165311, 0.00298531, 0.00392205, 0.00210574,
         0.00219817, 0.00298594, 0.00170954, 0.99139601, 0.00160043]]),
 'test-topic-document-matrix': array([], dtype=float64)}

My goal is to at least know the topic for each input data (it'll be great if I can predict new data by the trained model too!)

  • I used trial data, so the result is still not that great, but my main objective is to understand how to do topic modelling with OCTIS
  • Num topic = 5 (if this helps)
  • OCTIS framework is new, it was published last year! (2021) 

Solution

  • Topics in the training data

    The topics that the model has found are represented by the top 10 words in that topic. These can be found in output['topics']. So your first topic would be represented by the words: ['vaksin','sertifikat','aplikasi','pertama','alhamdulillah','pedulilindungi','ada','padahal','terimakasih','nik'].

    To know which topics are found in which document, you should look at output['topic-document-matrix']. The first list in this list represents the distribution of topics in the first document of your training data. Example: the first document mostly consists of topic 2 (because of the value 0.98769081)

    Prediction on new documents

    Unfortunately, this is not possible using OCTIS. OCTIS is exclusively a package for optimizing and comparing topic models. It is possible to define a test set, to see how models perform on unseen data. However, OCTIS is not suitable for developing production topic models. If that is your goal, take a look at gensim. (This is the package that OCTIS uses behind the scenes.)