I am in a case that I don't have any pre-trained words embedding for my domain (Vietnamese food reviews). so I got a though of embedding from the general and specific corpus.
And the point here is can I use the dataset of training, test and validating (did preprocess) as a source for creating my own word embeddings. If don't, hope you can give your experience.
Based on my intuition, and some experiments a wide corpus appears to be better, but I'd like to know if there's relevant research or other relevant results.
can I use the dataset of training, test and validating (did preprocess) as a source for creating my own word embeddings
Sure, embeddings are not your features for your machine learning model. They are the "computational representation" of your data. In short, they are made of words represented in a vector space. With embeddings, your data is less sparse. Using word embeddings could be considered part of the pre-processing step of NLP.
Usually (I mean, using the most used technique, word2vec), the representation of a word in the vector space is defined by its surroundings (the words that it commonly goes along with).
Therefore, to create embeddings, the larger the corpus, the better, since it can better place a word vector in the vector space (and hence compare it to other similar words).