Every time I read about Word2vec, the embedding is obtained with a very simple Autoencoder: just one hidden layer, linear activation for the initial layer, and softmax for the output layer.
My question is: why can't I train some Word2vec model using a stacked Autoencoder, with several hidden layers with fancier activation functions? (The softmax at the output would be kept, of course.)
I never found any explanation about this, therefore any hint is welcome.
Word vectors are noting but hidden states of a neural network trying to get good at something.
To answer your question Of course you can.
If you are going to do it why not use fancier networks/encoders as well like BiLSTM
or Transformers
.
This is what people who created things like ElMo
and BERT
did(though their networks were a lot fancier).