keras deep-learning classification multilabel-classification transformer-model

Multilabel classification of a sequence, how to do it?

I am quite new to the deep learning field especially Keras. Here I have a simple problem of classification and I don't know how to solve it. What I don't understand is how the general process of the classification, like converting the input data into tensors, the labels, etc.

Let's say we have three classes, 1, 2, 3.

There is a sequence of classes that need to be classified as one of those classes. The dataset is for example

Sequence 1, 1, 1, 2 is labeled 2
Sequence 2, 1, 3, 3 is labeled 1
Sequence 3, 1, 2, 1 is labeled 3

and so on.

This means the input dataset will be

[[1, 1, 1, 2],
 [2, 1, 3, 3],
 [3, 1, 2, 1]]

and the label will be

[[2],
 [1],
 [3]]

Now one thing that I do understand is to one-hot encode the class. Because we have three classes, every 1 will be converted into [1, 0, 0], 2 will be [0, 1, 0] and 3 will be [0, 0, 1]. Converting the example above will give a dataset of 3 x 4 x 3, and a label of 3 x 1 x 3.

Another thing that I understand is that the last layer should be a softmax layer. This way if a test data like (e.g. [1, 2, 3, 4]) comes out, it will be softmaxed and the probabilities of this sequence belonging to class 1 or 2 or 3 will be calculated.

Am I right? If so, can you give me an explanation/example of the process of classifying these sequences?

Thank you in advance.

Solution

Here are a few clarifications that you seem to be asking about.

This point was confusing so I deleted it.
If your input data has the shape (4), then your input tensor will have the shape (batch_size, 4).
Softmax is the correct activation for your prediction (last) layer given your desired output, because you have a classification problem with multiple classes. This will yield output of shape (batch_size, 3). These will be the probabilities of each potential classification, summing to one across all classes. For example, if the classification is class 0, then a single prediction might look something like [0.9714,0.01127,0.01733].
Batch size isn't hard-coded to the network, hence it is represented in model.summary() as None. E.g. the network's last-layer output shape can be written (None, 3).
Unless you have an applicable alternative, a softmax prediction layer requires a categorical_crossentropy loss function.
The architecture of a network remains up to you, but you'll at least need a way in and a way out. In Keras (as you've tagged), there are a few ways to do this. Here are some examples:

Example with Keras Sequential

model = Sequential()
model.add(InputLayer(input_shape=(4,))) # sequence of length four
model.add(Dense(3, activation='softmax')) # three possible classes

Example with Keras Functional

input_tensor = Input(shape=(4,))
x = Dense(3, activation='softmax')(input_tensor)
model = Model(input_tensor, x)

Example including input tensor shape in first functional layer (either Sequential or Functional):

model = Sequential()
model.add(Dense(666, activation='relu', input_shape=(4,)))
model.add(Dense(3, activation='softmax'))

Hope that helps!