Search code examples
neural-networknlpword2vecautoencoder

Can I train Word2vec using a Stacked Autoencoder with non-linearities?


Every time I read about Word2vec, the embedding is obtained with a very simple Autoencoder: just one hidden layer, linear activation for the initial layer, and softmax for the output layer.

My question is: why can't I train some Word2vec model using a stacked Autoencoder, with several hidden layers with fancier activation functions? (The softmax at the output would be kept, of course.)

I never found any explanation about this, therefore any hint is welcome.


Solution

  • Word vectors are noting but hidden states of a neural network trying to get good at something.

    To answer your question Of course you can.

    If you are going to do it why not use fancier networks/encoders as well like BiLSTM or Transformers.

    This is what people who created things like ElMo and BERT did(though their networks were a lot fancier).