neural-network nlp pytorch text-classification

Can we have inputs that is more than 1D in Pytorch (e.g word-embedding)

Say I have some text and I want to classify them into three groups food, sports, science. If I have a sentence I dont like to each mushrooms we can use wordembedding (say 100 dimensions) to create a 6x100 matrix for this particular sentense.

Ususally when training a neural-network our data is a 2D array with the dimensions n_obs x m_features

If I want to train a neural network on wordembedded sentences(i'm using Pytorch) then our input is 3D n_obs x (m_sentences x k_words)

e.g

#Say word-embedding is 3-dimensions
I = [1,2,3]
dont = [4,5,6]
eat = [7,8,9]
mushrooms = [10,11,12]

"I dont eat mushrooms" = [I,dont,eat,mushrooms] #First observation

Is the best way, when we have N>2 dimensions, to do some kind of pooling e.g mean, or can we use the actual 2D-features as input?

Solution

Technically the input will be 1D, but that doesn't matter.

The internal architecture of your neural network will take care of recognizing the different words. You could for example have a convolution with a stride equal to the embedding size.

You can flatten a 2D input to become 1D and it will work fine. This is the way you'd normally do it with word embeddings.

I = [1,2,3]
dont = [4,5,6]
eat = [7,8,9]
mushrooms = [10,11,12]

input = np.array([I,dont,eat,mushrooms]).flatten()

The inputs of a neural network have to always be of the same size, but as sentences are not, you will probably have to limit the the max length of the sentence to a set length of words and add paddings to the end of the shorter sentences:

I = [1,2,3]
Am = [4,5,6]
short = [7,8,9]
paddingword = [1,1,1]

input = np.array([I,Am,eat,short, paddingword]).flatten()

Also you might want to look at doc2vec from gensim, which is an easy way to make embeddings for texts, which are then easy to use for a text classification problem.