Momentum in neural networks

Neural networks and momentum

Should the momentum factor preferably relate to [both the dataset instance and the individual weights] or [just the weights]. Eg:

def get_momentum( instance, weight ):
   return float

instance1 = vector 1xn
instance2 = vector 1xn
weights   = vector 1xn

# Option 1
get_momentum( instance1, weights[0] ) # eg returns 0.1
get_momentum( instance2, weights[0] ) # eg returns 0.3 <-- same weight, different momentum

# Option 2
get_momentum( instance1, weights[0] ) # eg returns 0.1
get_momentum( instance2, weights[0] ) # eg returns 0.1

The second alternative would have lower memory complexity. I believe it would also cause the learning algorithm to be more likely to get stuck in local optima than the first alternative. Option 1 should cause a stronger momentum "pull".

Solution

Tested

I've done some testing of my hypothesis. The two approaches appears to perform almost the same, but there is an evident improvement by using the first alternative.

Memory complexity of the momentum data structure:

Approach 1: O( instances * weights )
Approach 2: O( weights )

Result:

Each round uses a predefined weight set. Both versions were trained on the same weight set.

$ pypy backprop.py # First approach
Round: 1/10     Required epochs: 40995
Round: 2/10     Required epochs: 40997
Round: 3/10     Required epochs: 40996
Round: 4/10     Required epochs: 40997
Round: 5/10     Required epochs: 40997
Round: 6/10     Required epochs: 40997
Round: 7/10     Required epochs: 40999
Round: 8/10     Required epochs: 40996
Round: 9/10     Required epochs: 40996
Round: 10/10    Required epochs: 40997

$ pypy backprop.py # Second approach
Round: 1/10     Required epochs: 41070
Round: 2/10     Required epochs: 41072
Round: 3/10     Required epochs: 41069
Round: 4/10     Required epochs: 41069
Round: 5/10     Required epochs: 41070
Round: 6/10     Required epochs: 41071
Round: 7/10     Required epochs: 41072
Round: 8/10     Required epochs: 41069
Round: 9/10     Required epochs: 41070
Round: 10/10    Required epochs: 41071

As we may read from the tests, the second approach (which has lower memory complexity) requires a few more epochs of training before reaching the required precision.

Conclusion

The increased memory complexity might not be a worthy sacrifice in comparison to the minor training improvement.