Wait. BoW and Contextual Embeddings have different sizes

Working with the OCTIS package, I am running a CTM topic model on the BBC (default) dataset.

import octis
from octis.dataset.dataset import Dataset
from octis.models.CTM import CTM

datasets = ['BBC_news', '20NewsGroup', 'DBLP','M10']
num_topics = [i for i in range(5,101,5)]

ALGORITHM = CTM

def create_topic_dict(algorithm):
    run_dict = dict()
    for data in datasets:
        data_dict = dict()
        for top in num_topics:    
            dataset = Dataset()
            dataset.fetch_dataset(data)
            model = algorithm(num_topics=top)
            trained_model = model.train_model(dataset)
            data_dict[top] = trained_model
        run_dict[data] = data_dict
    return run_dict

topic_dict = dict()

I call this function several times, so that my results are more robust. However, already in the first call, I get the following exception:

Run: 0

Batches:   0%|          | 0/16 [00:00<?, ?it/s]
Batches:   6%|▋         | 1/16 [00:37<09:21, 37.42s/it]
Batches:  12%|█▎        | 2/16 [01:13<08:36, 36.87s/it]
Batches:  19%|█▉        | 3/16 [01:50<07:56, 36.66s/it]
Batches:  25%|██▌       | 4/16 [02:26<07:19, 36.65s/it]
Batches:  31%|███▏      | 5/16 [03:03<06:43, 36.65s/it]
Batches:  38%|███▊      | 6/16 [03:40<06:05, 36.59s/it]
Batches:  44%|████▍     | 7/16 [04:16<05:28, 36.55s/it]
Batches:  50%|█████     | 8/16 [04:52<04:51, 36.44s/it]
Batches:  56%|█████▋    | 9/16 [05:26<04:09, 35.65s/it]
Batches:  62%|██████▎   | 10/16 [05:53<03:17, 32.94s/it]
Batches:  69%|██████▉   | 11/16 [06:19<02:33, 30.71s/it]
Batches:  75%|███████▌  | 12/16 [06:41<01:52, 28.12s/it]
Batches:  81%|████████▏ | 13/16 [07:02<01:18, 26.09s/it]
Batches:  88%|████████▊ | 14/16 [07:21<00:47, 23.87s/it]
Batches:  94%|█████████▍| 15/16 [07:37<00:21, 21.57s/it]
Batches: 100%|██████████| 16/16 [07:44<00:00, 17.16s/it]
Batches: 100%|██████████| 16/16 [07:44<00:00, 29.04s/it]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]
Batches:  25%|██▌       | 1/4 [00:36<01:50, 36.80s/it]
Batches:  50%|█████     | 2/4 [01:14<01:14, 37.15s/it]
Batches:  75%|███████▌  | 3/4 [01:41<00:32, 32.63s/it]
Batches: 100%|██████████| 4/4 [01:46<00:00, 21.82s/it]
Batches: 100%|██████████| 4/4 [01:46<00:00, 26.67s/it]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]
Batches:  25%|██▌       | 1/4 [00:34<01:42, 34.31s/it]
Batches:  50%|█████     | 2/4 [01:08<01:08, 34.46s/it]
Batches:  75%|███████▌  | 3/4 [01:35<00:31, 31.02s/it]
Batches: 100%|██████████| 4/4 [01:40<00:00, 20.53s/it]
Batches: 100%|██████████| 4/4 [01:40<00:00, 25.07s/it]
Traceback (most recent call last):
  File "/hpc/uu_ics_ads/erijcken/ACL/CTM_ACL_HPC.py", line 38, in <module>
    topic_dict[run] = create_topic_dict(ALGORITHM)
  File "/hpc/uu_ics_ads/erijcken/ACL/CTM_ACL_HPC.py", line 28, in create_topic_dict
    trained_model = model.train_model(dataset)
  File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/CTM.py", line 95, in train_model
    x_train, x_test, x_valid, input_size = self.preprocess(
  File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/CTM.py", line 177, in preprocess
    train_data = dataset.CTMDataset(x_train.toarray(), b_train, idx2token)
  File "/home/uu_ics_ads/erijcken/.local/lib/python3.8/site-packages/octis/models/contextualized_topic_models/datasets/dataset.py", line 17, in __init__
    raise Exception("Wait! BoW and Contextual Embeddings have different sizes! "
Exception: Wait! BoW and Contextual Embeddings have different sizes! You might want to check if the BoW preparation method has removed some documents.

Why do I get this exception? And what can I do to resolve this? It seems like I have taken the steps needed to run the model.

Solution

I'm one of the developers of OCTIS.

Short answer: If I understood your problem, you can fix this issue by modifying the parameter "bert_path" of CTM and make it dataset-specific, e.g. CTM(bert_path="path/to/store/the/files/" + data)

TL;DR: I think the problem is related to the fact that CTM generates and stores the document representations in some files with a default name. If these files already exist, it uses them without generating new representations, even if the dataset has changed in the meantime. Then CTM will raise that issue because it is using the BOW representation of a dataset, but the contextualized representations of another dataset, resulting in two representations with different dimensions. Changing the name of the files with respect to the name of the dataset will allow the model to retrieve the correct representations.

If you have other issues, please open a GitHub issue in the repo. I've found out about this issue by chance.