python-3.x tensorflow keras neural-network sparse-matrix

How to handle correctly sparse features to avoid poor performance of classification neural network?

I'm trying to understand how sparse neural networks work. I have a very sparse data of about 40k rows for two classes. The dataset looks like this:

    RA0 RA1 RA2 RA3 RA4 RA5 RA6 RA7 RA8 RA9 RB0 RB1 RB2 RB3 RB4 RB5 RB6 RB7 RB8 RB9
50  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
51  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
52  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
53  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
54  0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
55  1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
57  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
58  1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
59  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
60  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
61  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
62  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
63  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

As you can see, some rows have only 0's on it. The columns with name RA are the features of a class 0 and the columns with name RB are the features of class 1, so the same dataset with the actual labels looks like this:

    RA0 RA1 RA2 RA3 RA4 RA5 RA6 RA7 RA8 RA9 ... RB1 RB2 RB3 RB4 RB5 RB6 RB7 RB8 RB9 label
50  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
51  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
52  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
53  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
54  0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
55  1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
57  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
58  1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
59  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
60  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0

I did a simple neural network model using Keras, but the model isn't learning and accuracy rarely goes beyond 52% on train dataset. I tried two variations of the same model:

Variation 1:

def build_nn(n_features,lr = 0.001):
    _input = Input(shape = (n_features,),name = 'input',sparse = True)
    x = Dense(12,kernel_initializer = 'he_uniform',activation = 'relu')(_input)
    x = Dropout(0.5)(x)
    x = Dense(8,kernel_initializer = 'he_uniform',activation = 'relu')(x)
    x = Dropout(0.5)(x)
    x = Dense(2,kernel_initializer = 'he_uniform',activation = 'softmax')(x)
    nn = Model(inputs = [_input],outputs = [x])
    nn.compile(loss='sparse_categorical_crossentropy',optimizer=Adam(lr = lr),metrics=['accuracy'])
    return nn

Variation 2:

def build_nn(feature_layer,lr = 0.001):
    feature_inputs = {}
    for feature in feature_layer:
        feature_inputs[feature.key] = Input(shape = (1,),name = feature.key)
    feature_layer = tf.keras.layers.DenseFeatures(feature_layer)
    feature_inputs_n = feature_layer(feature_inputs)
    x = Dense(12,kernel_initializer = 'he_uniform',activation = 'relu')(feature_inputs_n)
    x = Dropout(0.5)(x)
    x = Dense(8,kernel_initializer = 'he_uniform',activation = 'relu')(x)
    x = Dropout(0.5)(x)
    x = Dense(2,kernel_initializer = 'he_uniform',activation = 'softmax')(x)
    nn = Model(inputs = [v for v in feature_inputs.values()],outputs = [x])
    nn.compile(loss='sparse_categorical_crossentropy',optimizer=Adam(lr = lr),metrics=['accuracy'])
    return nn

The motivation behind doing the variation 2 is because the features are sparse and I thought that this could have an impact on the model's performance, so I followed this tensorflow guide.

Also, the labels are converted to a categorical label using to_categorical function, provided by the keras api:

y_train2 = to_categorical(y_train)
y_test2 = to_categorical(y_test)

My questions are:

Is my model wrong (especially the variation 2) or if I'm doing the wrong representation of the sparse features and how this features should be handled?
The RA and RB are the features of two different classes and since there are rows full of 0, should I add a third class representing an unknown class or remove the rows that contains only 0?
Since RA and RB map two different classes, should I do two separate model, one for columns RA and class 0 and the other for columns RB and class 1?

I'm also posting an image of the train/test model's accuracy:

I can also provide any other part of the code if needed.

EDIT: I didn't put this part because I felt it doesn't has a relation to what I was asking, but it seems I was wrong.

Each feature is an individual branch from a sklearn decision tree. The class that the decision tree looks for is an up or down for the next candle in a trading enviroment (a candle is a price aggregation of an instrument in time that has an open, low, high and close price). Then, the idea is to grab those branches, that are valuated in the price time series, and evaluate if the condition is met, so if the branch is active the value is 1.

For example, branch RA0 at index 55 is active, so the value is 1. The labels are calculated as np.sign(close - open). So, the idea is that by using multiple branches the classification of the label can be improved, by having a neural network that can see if which branch is active and which one has more weight in order to make a classification.

Solution

The use of sparse_categorical_crossentropy is wrong here; the sparsity in sparse_categorical_crossentropy refers to the label representation, and not to the features. Since you are using one-hot encoded labels:

y_train2 = to_categorical(y_train)
y_test2 = to_categorical(y_test)

and a final layer of 2 nodes with activation = 'softmax' (which I take it to mean that you have only 2 classes), you should switch to loss='categorical_crossentropy' irrespectively of the sparsity in your features.

Other general remarks:

Remove dropout, which should never be used by default. Dropout is used to help against overfitting if such a thing is detected; used uncritically (even worse, with such high values), it is well-known to prevent training altogether (i.e. something very similar to what you report here).
Remove kernel_initializer = 'he_uniform' from all layers, thus leaving the default glorot_uniform one (useful hint: default values are there for a reason, and it is not advisable to play with them unless you have a specific reason to do so and you know exactly what you are doing).