Search code examples
machine-learningneural-networkdeep-learningmxnet

mxnet training not progressing


Thanks in advance for any help.

I am having some issues getting an mxnet model to converge to anything: it seems stuck close to its initial weights.

A working example (although I have struggled to get many such models working today). I have tried the approach below with a range of epochs (up to 100), and a range of learning rates (0.001 to 10), and cannot get anything sensible out of this.

import mxnet as mx
import numpy as np

inputs = np.expand_dims(np.random.uniform(size=10000), axis=1)
labels = np.sin(inputs)

data_iter = mx.io.NDArrayIter(data=inputs, label=labels, data_name='data', label_name='label', batch_size=50)

data = mx.sym.Variable('data')
label = mx.sym.Variable('label')

fc1 = mx.sym.FullyConnected(data=data, num_hidden=128)
ac1 = mx.sym.Activation(data=fc1, act_type='relu')

fc2 = mx.sym.FullyConnected(data=ac1, num_hidden=64)
ac2 = mx.sym.Activation(data=fc2, act_type='relu')

fc3 = mx.sym.FullyConnected(data=ac2, num_hidden=16)
ac3 = mx.sym.Activation(data=fc3, act_type='relu')

output = mx.sym.FullyConnected(data=ac3, num_hidden=1)
loss = mx.symbol.MakeLoss(mx.symbol.square(output - label), name="loss")

model = mx.module.Module(symbol=loss, data_names=('data',), label_names=('label',))

import logging
logging.getLogger().setLevel(logging.DEBUG)
model.fit(data_iter,
          optimizer='sgd',
          optimizer_params={'learning_rate':0.1},
          eval_metric='mse',
          num_epoch=5)

gives rise to:

INFO:root:Epoch[0] Train-mse=0.221155
INFO:root:Epoch[0] Time cost=0.173
INFO:root:Epoch[1] Train-mse=0.225179
INFO:root:Epoch[1] Time cost=0.176
INFO:root:Epoch[2] Train-mse=0.225179
INFO:root:Epoch[2] Time cost=0.179
INFO:root:Epoch[3] Train-mse=0.225179
INFO:root:Epoch[3] Time cost=0.176
INFO:root:Epoch[4] Train-mse=0.225179
INFO:root:Epoch[4] Time cost=0.183

where it's clear the training isn't really progressing.


Solution

  • I took your code and updated it a bit, and was able to make it converge, code is pasted below.

    Updates I made: I Updated the layers, to have only two fully connected layers, with 128 units each, updated the loss function to use the built in Linear Regression, added Momentum and updated the learning rate, and lastly - running more epochs

    Hope this helps!

    import mxnet as mx
    import numpy as np
    
    inputs = np.expand_dims(np.random.uniform(size=10000), axis=1)
    labels = np.sin(inputs)
    
    data_iter = mx.io.NDArrayIter(data=inputs, label=labels, data_name='data', label_name='label', batch_size=50)
    
    data = mx.sym.Variable('data')
    label = mx.sym.Variable('label')
    
    fc1 = mx.sym.FullyConnected(data=data, num_hidden=128)
    ac1 = mx.sym.Activation(data=fc1, act_type='relu')
    
    fc2 = mx.sym.FullyConnected(data=ac1, num_hidden=128)
    ac2 = mx.sym.Activation(data=fc2, act_type='relu')
    
    output = mx.sym.FullyConnected(data=ac2, num_hidden=1)
    #loss = mx.symbol.MakeLoss(mx.symbol.square(output - label), name="loss")
    loss = mx.sym.LinearRegressionOutput(data=output, label=label, name="loss")
    
    model = mx.module.Module(symbol=loss, data_names=('data',), label_names=('label',))
    
    import logging
    logging.getLogger().setLevel(logging.DEBUG)
    model.fit(data_iter,
              optimizer='sgd',
              optimizer_params={'learning_rate':0.005, 'momentum': 0.9},
              eval_metric='mse',
              num_epoch=50)
    

    Results:

    INFO:root:Epoch[0] Train-mse=0.076923
    INFO:root:Epoch[0] Time cost=0.148
    INFO:root:Epoch[1] Train-mse=0.061155
    INFO:root:Epoch[1] Time cost=0.178
    INFO:root:Epoch[2] Train-mse=0.061154
    INFO:root:Epoch[2] Time cost=0.168
    INFO:root:Epoch[3] Train-mse=0.061153
    INFO:root:Epoch[3] Time cost=0.151
    INFO:root:Epoch[4] Train-mse=0.061151
    INFO:root:Epoch[4] Time cost=0.182
    INFO:root:Epoch[5] Train-mse=0.061150
    INFO:root:Epoch[5] Time cost=0.186
    INFO:root:Epoch[6] Train-mse=0.061149
    INFO:root:Epoch[6] Time cost=0.197
    INFO:root:Epoch[7] Train-mse=0.061147
    INFO:root:Epoch[7] Time cost=0.174
    INFO:root:Epoch[8] Train-mse=0.061145
    INFO:root:Epoch[8] Time cost=0.148
    INFO:root:Epoch[9] Train-mse=0.061142
    INFO:root:Epoch[9] Time cost=0.150
    INFO:root:Epoch[10] Train-mse=0.061140
    INFO:root:Epoch[10] Time cost=0.145
    INFO:root:Epoch[11] Train-mse=0.061136
    INFO:root:Epoch[11] Time cost=0.135
    INFO:root:Epoch[12] Train-mse=0.061133
    INFO:root:Epoch[12] Time cost=0.136
    INFO:root:Epoch[13] Train-mse=0.061128
    INFO:root:Epoch[13] Time cost=0.137
    INFO:root:Epoch[14] Train-mse=0.061122
    INFO:root:Epoch[14] Time cost=0.146
    INFO:root:Epoch[15] Train-mse=0.061116
    INFO:root:Epoch[15] Time cost=0.135
    INFO:root:Epoch[16] Train-mse=0.061108
    INFO:root:Epoch[16] Time cost=0.152
    INFO:root:Epoch[17] Train-mse=0.061098
    INFO:root:Epoch[17] Time cost=0.179
    INFO:root:Epoch[18] Train-mse=0.061086
    INFO:root:Epoch[18] Time cost=0.160
    INFO:root:Epoch[19] Train-mse=0.061069
    INFO:root:Epoch[19] Time cost=0.151
    INFO:root:Epoch[20] Train-mse=0.061050
    INFO:root:Epoch[20] Time cost=0.145
    INFO:root:Epoch[21] Train-mse=0.061024
    INFO:root:Epoch[21] Time cost=0.164
    INFO:root:Epoch[22] Train-mse=0.060990
    INFO:root:Epoch[22] Time cost=0.151
    INFO:root:Epoch[23] Train-mse=0.060944
    INFO:root:Epoch[23] Time cost=0.141
    INFO:root:Epoch[24] Train-mse=0.060881
    INFO:root:Epoch[24] Time cost=0.136
    INFO:root:Epoch[25] Train-mse=0.060790
    INFO:root:Epoch[25] Time cost=0.124
    INFO:root:Epoch[26] Train-mse=0.060658
    INFO:root:Epoch[26] Time cost=0.151
    INFO:root:Epoch[27] Train-mse=0.060455
    INFO:root:Epoch[27] Time cost=0.166
    INFO:root:Epoch[28] Train-mse=0.060131
    INFO:root:Epoch[28] Time cost=0.148
    INFO:root:Epoch[29] Train-mse=0.059582
    INFO:root:Epoch[29] Time cost=0.219
    INFO:root:Epoch[30] Train-mse=0.058581
    INFO:root:Epoch[30] Time cost=0.160
    INFO:root:Epoch[31] Train-mse=0.056593
    INFO:root:Epoch[31] Time cost=0.178
    INFO:root:Epoch[32] Train-mse=0.052252
    INFO:root:Epoch[32] Time cost=0.184
    INFO:root:Epoch[33] Train-mse=0.042274
    INFO:root:Epoch[33] Time cost=0.168
    INFO:root:Epoch[34] Train-mse=0.023321
    INFO:root:Epoch[34] Time cost=0.162
    INFO:root:Epoch[35] Train-mse=0.005860
    INFO:root:Epoch[35] Time cost=0.161
    INFO:root:Epoch[36] Train-mse=0.000848
    INFO:root:Epoch[36] Time cost=0.164
    INFO:root:Epoch[37] Train-mse=0.000319
    INFO:root:Epoch[37] Time cost=0.176
    INFO:root:Epoch[38] Train-mse=0.000221
    INFO:root:Epoch[38] Time cost=0.148
    INFO:root:Epoch[39] Train-mse=0.000163
    INFO:root:Epoch[39] Time cost=0.199
    INFO:root:Epoch[40] Train-mse=0.000123
    INFO:root:Epoch[40] Time cost=0.141
    INFO:root:Epoch[41] Train-mse=0.000096
    INFO:root:Epoch[41] Time cost=0.133
    INFO:root:Epoch[42] Train-mse=0.000078
    INFO:root:Epoch[42] Time cost=0.144
    INFO:root:Epoch[43] Train-mse=0.000065
    INFO:root:Epoch[43] Time cost=0.174
    INFO:root:Epoch[44] Train-mse=0.000056
    INFO:root:Epoch[44] Time cost=0.208
    INFO:root:Epoch[45] Train-mse=0.000050
    INFO:root:Epoch[45] Time cost=0.152
    INFO:root:Epoch[46] Train-mse=0.000045
    INFO:root:Epoch[46] Time cost=0.154
    INFO:root:Epoch[47] Train-mse=0.000041
    INFO:root:Epoch[47] Time cost=0.151
    INFO:root:Epoch[48] Train-mse=0.000039
    INFO:root:Epoch[48] Time cost=0.177
    INFO:root:Epoch[49] Train-mse=0.000036
    INFO:root:Epoch[49] Time cost=0.135