Search code examples
pythonnlpmultiprocessinggensim

Gensim ensemblelda multiprocessing: index -1 is out of bounds for axis 0 with size 0


I'm using the gensim library for topic modelling, more precisely the Ensemble LDA method. My code is fairly standard (I follow the documentation), the main part is:

           model = models.EnsembleLda(corpus=corpus,
                                   id2word=id2word,
                                   num_topics=ntopics,
                                   passes=2,
                                   iterations = 200,
                                   num_models=ncores,
                                   topic_model_class=models.LdaModel,
                                   ensemble_workers=nworkers,
                                   distance_workers=ncores)

(full code at https://github.com/erwanm/gensim-temporary/blob/main/gensim-topics.py)

But with my data I sometimes obtain the error below. But it also often runs correctly with a subset of the data, so I don't know if the problem is related to my data?

Process Process-52:
Traceback (most recent call last):
  File "/home/moreaue/anaconda3/envs/twarc2/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/moreaue/anaconda3/envs/twarc2/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/moreaue/anaconda3/envs/twarc2/lib/python3.10/site-packages/gensim/models/ensemblelda.py", line 534, in _asymmetric_distance_matrix_worker
    distance_chunk = _calculate_asymmetric_distance_matrix_chunk(
  File "/home/moreaue/anaconda3/envs/twarc2/lib/python3.10/site-packages/gensim/models/ensemblelda.py", line 491, in _calculate_asymmetric_distance_matrix_chunk
    mask = masking_method(ttd1, masking_threshold)
  File "/home/moreaue/anaconda3/envs/twarc2/lib/python3.10/site-packages/gensim/models/ensemblelda.py", line 265, in mass_masking
    smallest_valid = sorted_a[largest_mass][-1]
IndexError: index -1 is out of bounds for axis 0 with size 0

The error seems related to multiprocessing, since ensemblelda runs a number of threads (each running one instance of LDA).

What can cause this error? Any advice on how I can fix it?


Solution

  • From reading the source code, the mass_masking function is simply returning a boolean array (binary mask) with truth values assigned to its positions that positionally correspond to the elements of a: where those elements of a, if you had to cumulatively summate them, would remain below the threshold set (default is 0.95).

    The line sorted_a[largest_mask][-1] is causing problems because nowhere in the mass_masking function are there any checks to make sure that the largest element of a is below the threshold, nor its predecessor ttd1. Whenever ttd1's, and therefore a's, largest value is equal to or above the threshold, the IndexError is raised because sorted_a[largest_mass] returns an empty array which is then attempted to be indexed by the [-1]. The minimal examples below showcase this issue.

    ttd1 is created in the _calculate_asymmetric_distance_matrix_chunk() function when it enumerates over ttda1. It calls masking_method(ttd1, masking_threshold), which subsequently invokes mass_masking which is set as the default method upon initialisation of the EnsembleLda class. I suspect this is a bug that is prevalent during multiprocessing because the _calculate_assymetric_distance_matrix_multiproc() function instantiates the Process with the target set as the _asymmetric_distance_matrix_worker() function, and this worker function slices the entire_ttda to create ttda1 which it passes to the chunk function, but again, there's no check to ensure it hasn't created a slice that has one of its elements equal to or larger than threshold.

    Here is a minimal working example of how mass_making should function:

    import numpy as np
    
    
    def mass_masking(a, threshold=None):
        """Original masking method. Returns a new binary mask."""
        if threshold is None:
            threshold = 0.95
    
        sorted_a = np.sort(a)[::-1]  # [0.5   0.449 0.2   0.1  ]
        # sorted_a.cumsum() outputs [0.5   0.949 1.149 1.249] then threshold is taken
        largest_mass = sorted_a.cumsum() < threshold  # [ True  True False False]
        # sorted_a[largest_mass] outputs [0.5   0.449] then last element is taken
        smallest_valid = sorted_a[largest_mass][-1]  # 0.449
        return a >= smallest_valid # [ True False False  True]
    
    
    test_arr = np.array([0.5, 0.2, 0.1, 0.449])
    print(mass_masking(test_arr))
    # Output: [ True False False  True]
    

    When a's largest element is greater than or equal to 0.95, an IndexError is raised:

    import numpy as np
    
    
    def mass_masking(a, threshold=None):
        """Original masking method. Returns a new binary mask."""
        if threshold is None:
            threshold = 0.95
    
        sorted_a = np.sort(a)[::-1]  # [0.96  0.5   0.449 0.2  ]
        # sorted_a.cumsum() outputs [0.96  1.46  1.909 2.109] then threshold is taken
        largest_mass = sorted_a.cumsum() < threshold  # [False False False False]
        # sorted_a[largest_mass] outputs [] then last element is taken
        smallest_valid = sorted_a[largest_mass][-1]  # IndexError
        return a >= smallest_valid # function fails to return binary mask
    
    
    test_arr = np.array([0.5, 0.2, 0.96, 0.449])
    print(mass_masking(test_arr))
    # Output: IndexError: index -1 is out of bounds for axis 0 with size 0
    

    Knowing this, and assuming I haven't misinterpreted how the whole group of calculate_asymmetric_distance_matrix functions are working during multiprocessing, you can either try to ensure yourself that no values are greater than or equal to the default threshold, or set the masking_threshold higher when initialising the EnsembleLda.