Search code examples
pythontensorflowdeep-learningneural-networkglove

tf.matmul(X,weight) vs tf.matmul(X,tf.traspose(weight)) in tensorflow


In standard ANN for fully connected layers we are using the following formula: tf.matmul(X,weight) + bias. Which is clear to me, as we use matrix multiplication in order to connect input with th hidden layer.

But in GloVe implementation(https://nlp.stanford.edu/projects/glove/) we are using the following formula for embeddings multiplication: tf.matmul(W, tf.transpose(U)) what confuses me is tf.transpose(U)part. Why do we use tf.matmul(W, tf.transpose(U)) instead of tf.matmul(W, U)?


Solution

  • It has to do with the choice of column vs row orientation for the vectors.

    Note that weight is the second parameter here:

    tf.matmul(X, weight)
    

    But the first parameter, W, here:

    tf.matmul(W, tf.transpose(U))
    

    So what you are seeing is a practical application of the following matrix transpose identity:

    matrix multiplication transpose identity


    To bring it back to your example, let's assume 10 inputs and 20 outputs.

    The first approach uses row vectors. A single input X would be a 1x10 matrix, called a row vector because it has a single row. To match, the weight matrix needs to be 10x20 to produce an output of size 20.

    But in the second approach the multiplication is reversed. That is a hint that everything is using column vectors. If the multiplication is reversed, then everything gets a transpose. So this example is using column vectors, so named because they have a single column.

    That's why the transpose is there. The way they GLoVe authors have done their notation, with the multiplication reversed, the weight matrix W must already be transposed to 20x10 instead of 10x20. And they must be expecting a 20x1 column vector for the output.

    So if the input vector U is naturally a 1x10 row vector, it also has to be transposed, to a 10x1 column vector, to fit in with everything else.


    Basically you should pick row vectors or column vectors, all the time, and then the order of multiplications and the transposition of the weights is determined for you.

    Personally I think that column vectors, as used by GloVe, are awkward and unnatural compared to row vectors. It's better to have the multiplication ordering follow the data flow ordering.