Search code examples
pythonarraysconcatenationword2vecword-embedding

ValueError: need at least one array to concatenate in Top2Vec Error


docs = ['Consumer discretionary, healthcare and technology are preferred China equity sectors.', 'Consumer discretionary remains attractive, supported by China’s policy to revitalize domestic consumption. Prospects of further monetary and fiscal stimulus should reinforce the Chinese consumption theme.', 'The healthcare sector should be a key beneficiary of the coronavirus outbreak, on the back of increased demand for healthcare services and drugs.', 'The technology sector should benefit from increased demand for cloud services and hardware demand as China continues to recover from the coronavirus outbreak.', 'China consumer discretionary sector is preferred. In our assessment, the sector is likely to outperform the MSCI China Index in the coming 6-12 months.']

model = Top2Vec(docs, embedding_model = 'universal-sentence-encoder')

while running the above command, I'm getting an error that is not clearly visible for debugging what could be the root cause for the error?

Error:

2021-01-19 05:17:08,541 - top2vec - INFO - Pre-processing documents for training INFO:top2vec:Pre-processing documents for training 2021-01-19 05:17:08,562 - top2vec - INFO - Downloading universal-sentence-encoder model INFO:top2vec:Downloading universal-sentence-encoder model 2021-01-19 05:17:13,250 - top2vec - INFO - Creating joint document/word embedding INFO:top2vec:Creating joint document/word embedding WARNING:tensorflow:5 out of the last 6 calls to <function recreate_function..restored_function_body at 0x7f8c4ce57d90> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. WARNING:tensorflow:5 out of the last 6 calls to <function recreate_function..restored_function_body at 0x7f8c4ce57d90> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details. 2021-01-19 05:17:13,548 - top2vec - INFO - Creating lower dimension embedding of documents INFO:top2vec:Creating lower dimension embedding of documents 2021-01-19 05:17:15,809 - top2vec - INFO - Finding dense areas of documents INFO:top2vec:Finding dense areas of documents 2021-01-19 05:17:15,823 - top2vec - INFO - Finding topics INFO:top2vec:Finding topics

ValueError Traceback (most recent call last) in () ----> 1 model = Top2Vec(docs, embedding_model = 'universal-sentence-encoder')

2 frames <array_function internals> in vstack(*args, **kwargs)

/usr/local/lib/python3.6/dist-packages/numpy/core/shape_base.py in vstack(tup) 281 if not isinstance(arrs, list): 282 arrs = [arrs] --> 283 return _nx.concatenate(arrs, 0) 284 285

<array_function internals> in concatenate(*args, **kwargs)

ValueError: need at least one array to concatenate


Solution

  • You need to use more docs and unique words for it to find at least 2 topics. As an example, I just multiply your list by 10 and it works:

    from top2vec import Top2Vec
    
    docs = ['Consumer discretionary, healthcare and technology are preferred China equity  sectors.',
    'Consumer discretionary remains attractive, supported by China’s policy to revitalize domestic consumption. Prospects of further monetary and fiscal stimulus  should reinforce the Chinese consumption theme.',
    'The healthcare sector should be a key beneficiary of the coronavirus outbreak,  on the back of increased demand for healthcare services and drugs.',
    'The technology sector should benefit from increased demand for cloud services  and hardware demand as China continues to recover from the coronavirus  outbreak.',
    'China consumer discretionary sector is preferred. In our assessment, the sector  is likely to outperform the MSCI China Index in the coming 6-12 months.']
    
    docs = docs*10 
    model = Top2Vec(docs, embedding_model='universal-sentence-encoder')
    print(model)
    

    <top2vec.Top2Vec.Top2Vec object at 0x13eef6210>

    I had few (30) long docs of up to 130 000 characters, so I just split them into smaller docs every 5000 characters:

    
    docs_split = []
    for doc in docs:
        skip_n = 5000
        for i in range(0,130000,skip_n):
            docs_split.append(doc[i:i+skip_n])