Search code examples
pythongensimword2vec

Gensim word2vec model outputs 1000 dimension ndarray but the maximum number of ndarray dimensions is 32 - how?


I'm trying to use this 1000 dimension wikipedia word2vec model to analyze some documents.

Using introspection I found out that the vector representation of a word is a 1000 dimension numpy.ndarray, however whenever I try to create an ndarray to find the nearest words I get a value error:

ValueError: maximum supported dimension for an ndarray is 32, found 1000

and from what I can tell by looking around online 32 is indeed the maximum supported number of dimensions for an ndarray - so what gives? How is gensim able to output a 1000 dimension ndarray?

Here is some example code:

doc = [model[word] for word in text if word in model.vocab]
out = []
n = len(doc[0])
print(n)
print(len(model["hello"]))
print(type(doc[0]))
for i in range(n):
    sum = 0
    for d in doc:
        sum += d[i]
    out.append(sum/n)
out = np.ndarray(out)

which outputs:

1000
1000
<class 'numpy.ndarray'>
ValueError: maximum supported dimension for an ndarray is 32, found 1000

The goal here would be to compute the average vector of all words in the corpus in a format that can be used to find nearby words in the model so any alternative suggestions to that effect are welcome.


Solution

  • You're calling numpy's ndarray() constructor-function with a list that has 1000 numbers in it – your hand-calculated averages of each of the 1000 dimensions.

    The ndarray() function expects its argument to be the shape of the matrix constructed, so it's trying to create a new matrix of shape (d[0], d[1], ..., d[999]) – and then every individual value inside that matrix would be addressed with a 1000-int set of coordinates. And, indeed numpy arrays can only have 32 independent dimensions.

    But even if you reduced the list you're supplying to ndarray() to just 32 numbers, you'd still have a problem, because your 32 numbers are floating-point values, and ndarray() is expecting integral counts. (You'd get a TypeError.)

    Along the approach you're trying to take – which isn't quite optimal as we'll get to below – you really want to create a single vector of 1000 floating-point dimensions. That is, 1000 cell-like values – not d[0] * d[1] * ... * d[999] separate cell-like values.

    So a crude fix along the lines of your initial approach could be replacing your last line with either:

    result = np.ndarray(len(d))
    for i in range(len(d)):
        result[i] = d[i]
    

    But there are many ways to incrementally make this more efficient, compact, and idiomatic – a number of which I'll mention below, even though the best approach, at bottom, makes most of these interim steps unnecessary.

    For one, instead of that assignment-loop in my code just above, you could use Python's bracket-indexing assignment option:

    result = np.ndarray(len(d))
    result[:] = d  # same result as previous 3-lines w/ loop
    

    But in fact, numpy's array() function can essentially create the necessary numpy-native ndarray from a given list, so instead of using ndarray() at all, you could just use array():

    result = np.array(d)  # same result as previous 2-lines
    

    But further, numpy's many functions for natively working with arrays (and array-like lists) already include things to do averages-of-many-vectors in a single step (where even the looping is hidden inside very-efficient compiled code or CPU bulk-vector operations). For example, there's a mean() function that can average lists of numbers, or multi-dimensional arrays of numbers, or aligned sets of vectors, and so forth.

    This allows faster, clearer, one-liner approaches that can replace your entire original code with something like:

    # get a list of available word-vetors
    doc = [model[word] for word in text if word in model.vocab]
    # average all those vectors
    out = np.mean(doc, axis=0)
    

    (Without the axis argument, it'd average together all individual dimension-values , in all slots, into just one single final average number.)