Can I train Word2vec using a Stacked Autoencoder with non-linearities?

Every time I read about Word2vec, the embedding is obtained with a very simple Autoencoder: just one hidden layer, linear activation for the initial layer, and softmax for the output layer.

My question is: why can't I train some Word2vec model using a stacked Autoencoder, with several hidden layers with fancier activation functions? (The softmax at the output would be kept, of course.)

I never found any explanation about this, therefore any hint is welcome.

Solution

Word vectors are noting but hidden states of a neural network trying to get good at something.

To answer your question Of course you can.

If you are going to do it why not use fancier networks/encoders as well like BiLSTM or Transformers.

This is what people who created things like ElMo and BERT did(though their networks were a lot fancier).

Training loss increases instead of decrease with epochs
Soundfile imports audio in two different formats
Error for neuralnet package in R
tensorflow.keras only runs correctly once
How do filters run across an RGB image, in first layer of a CNN?
Why do neural networks work so well?
Will larger batch size make computation time less in machine learning?
Neural network learning to sum two numbers
Why shouldn't we use multiple activation functions in the same layer?
how to implement custom metric in keras?
Change the threshold value of the keras RELU activation function
Neuralnet RMSE is 10x bigger than linear model's RMSE on test data set
Back Propagation in Convolutional Neural Networks and how to update filters
Ordering of batch normalization and dropout?
What is freezing/unfreezing a layer in neural networks?
Loaded Keras Model Throws Error While Predicting (Likely Issues with Masking)
Need some help debugging this java
Linear vs nonlinear neural network?
Can we use GNN on graphs with only edge features?
How do I initialize weights in PyTorch?
Does one convolutional filter always have different coefficients for each of the channels of the previous layer?
How to use a Dense layer with an input that has a dynamically sized dimension?
Should we do learning rate decay for adam optimizer
InceptionResnetV2 STEM block keras implementation mismatch the one in the original paper?
TypeError: '_IncompatibleKeys' object is not callable
Debugging GAN covergence
Loss is increasing from first epoch itself
Handling Absence of Color Data in 3D Mesh Neural Network Input
Can a neural network be trained while it changes in size?
unexpected transformer's dataset structure after set_transform or with_transform