I've got a dataset of positive and negative content. So let's assume it's a spam project.
I need to build a model, which can categorize the content in pos/neg. So I am doing a supervised learning task, because I've got a labeled dataset. The best choice therefore must be using a SVC model.
So far so good.
Now the complicated part comes.
I want to solve the same task by using Keras LSTM model. So my question:
Is it still supervised or is it unsupervised , because I am using word embeddings for this task and referring to this post here, word embedding is used for unsupervised tasks: https://www.quora.com/Is-deep-learning-supervised-unsupervised-or-something-else
There it says:
Deep learning can be Unsupervised : Word embedding, image encoding into lower or higher dimensional etc.
So - is it now unsupervised or supervised (because my dataset is labeled) ?
And is deep learning another technique like unsupervised and supervised learning or how is the relation between these topics? Is deep learning using supervised and unsupervised techniques? Or do one have to choose between deep learning, unsupervised and supervised learning?
It's so confusing! Please help! Especially for the LSTM task. I need to know where it's supervised (because of the labeled dataset) or unsupervised (because of the usage of word embeddings)
Thanks in advance guys!
A quick word of encouragement, I recall feeling precisely the same way; insanely frustrated when I started learning this field. It really does get easier!
Word embeddings are created by unsupervised learning. However, you can use a trained embedding layer within a supervised projected, like you're doing. In other words, your project is one of supervised learning, and one of the layers is using weights that were acquired by an unsupervised training technique.
It may be helpful to further understand embedding layers, how they're made, and what they can do for supervised learning. I'll try to explain in a non-technical way, so you can get a feel for the concept prior to learning the particulars and pedantisms.
Suppose you begin with a giant corpus. You count the frequency of occurrence of every word, and use it to rank each relative to the others (or use some other formula, whatever). This is a method of text "tokenization." The point is to get words into numbers. Obviously this is important, since we're fixing to do math with them, but it creates a bit of a pinch: the numerical relationships don't necessarily carry any information about the relationships of the meanings of the words. To ameliorate this, you can train a little network like so: take chunks from your corpus and create skipgrams, and teach the network that, after the application of weights and a measure of cosine similarity, the output produced should be a 1
if the words appear near each other (or some other criteria), or 0
(or perhaps -1
, if you prefer) when they do not appear near each other. Over the course of the corpus, words that tend to be used together will move together and likewise the inverse. The objective is to create a kind of map (or a simulacrum, if you will) of the relative meaning of the tokens (which are words); said another way, the objective is to create an n-dimensional representation of the words' relative meanings. Then, after training, the embeddings can be saved for use in projects like yours. Your embedding layer will then look up the token in the saved embeddings and grab its outputs, which are that word's vector representation in the embedding space; their coordinates in our theoretical map. This is considered "unsupervised" because you don't have to explicitly supply the ground-truth for comparison; in this case, it's being generated procedurally from the training sample (i.e. skipgrams generated from whatever the input was). Another example would be if the expected output was identical to the input (as in auto-encoders), which is unsupervised (as before) because you don't have to supply an expected output; if you supply an input, it automatically has the expected output.
If all of that is confusing, then just pause and consider your own thoughts: if I ask you for a word that means the same thing as "big" in the phrase "a big pizza," you consult your understanding of the meaning of "big" as pertains to the indicated phrase, and draw something as close to it as possible: perhaps the word "large." Embeddings are a way of making a map where "big" and "large" are positioned very close together along most axes (i.e. in most dimensions).
So, then, when you load some pre-trained embeddings, you're just loading some weights into one of your layers. Sometimes people initialize layers with zeros, other times people use random normal or gaussian distributions, and sometimes people use specific values (e.g. loading a saved network, or loading embeddings); it's all the same. If you go on to perform supervised training, then you're doing precisely that: performing supervised training. Following the embedding layer, the information you're working with is not arbitrary words, but rather these: relative meanings. And if that isn't just neat, I don't know what is! I find it's helpful to consider what your data represents as it passes through the network.