Say I have some text and I want to classify them into three groups food, sports, science
. If I have a sentence I dont like to each mushrooms
we can use wordembedding (say 100 dimensions) to create a 6x100
matrix for this particular sentense.
Ususally when training a neural-network our data is a 2D array with the dimensions n_obs x m_features
If I want to train a neural network on wordembedded sentences(i'm using Pytorch) then our input is 3D n_obs x (m_sentences x k_words)
e.g
#Say word-embedding is 3-dimensions
I = [1,2,3]
dont = [4,5,6]
eat = [7,8,9]
mushrooms = [10,11,12]
"I dont eat mushrooms" = [I,dont,eat,mushrooms] #First observation
Is the best way, when we have N>2 dimensions, to do some kind of pooling e.g mean, or can we use the actual 2D-features as input?
Technically the input will be 1D, but that doesn't matter.
The internal architecture of your neural network will take care of recognizing the different words. You could for example have a convolution with a stride equal to the embedding size.
You can flatten a 2D input to become 1D and it will work fine. This is the way you'd normally do it with word embeddings.
I = [1,2,3]
dont = [4,5,6]
eat = [7,8,9]
mushrooms = [10,11,12]
input = np.array([I,dont,eat,mushrooms]).flatten()
The inputs of a neural network have to always be of the same size, but as sentences are not, you will probably have to limit the the max length of the sentence to a set length of words and add paddings to the end of the shorter sentences:
I = [1,2,3]
Am = [4,5,6]
short = [7,8,9]
paddingword = [1,1,1]
input = np.array([I,Am,eat,short, paddingword]).flatten()
Also you might want to look at doc2vec from gensim, which is an easy way to make embeddings for texts, which are then easy to use for a text classification problem.