Why scaling data is very important in neural network(LSTM)

I am writing my master thesis about how to apply LSTM neural network in time series. In my experiment, i found out that scaling data can have a great impact on the result. For example, when i use a tanh activation function, and the value range is between -1 and 1, the model seems to converge faster and the validation error also does not jump dramatically after each epoch.

Does anyone know is there any mathmetical explanation for that? Or is there any papers already explain about this situation?

Solution

Your question reminds me of a picture used in our class, but you can find a similar one from here at 3:02.

In the picture above you can see obviously that the path on the left is much longer than that on the right. The scaling is applied to the left to become the right one.

I want to install the "n" package and I get an error
n <version> command does not activate specified version
Change n install location
How to install a specific version of Node on Ubuntu/Debian?
Different node version for different projects, is there a way of telling node which version to use?
Install Node.js to install n to install Node.js?
How to select the latest node.js v6 version using n?
n-install: ERROR: GNU Make not found, which is required for operation
How to downgrade Node version with n
how switch to previous version in n (Node version manager)?
Automatically use the right version of Node for a package
internal/modules/cjs/loader.js:905 -> throw err;
Why doesn't "n" downgrade my node version on a Mac?
Node version manager
n failed to install/switch node in Linux?
vue command not found on Mac
How to uninstall n and all node versions installed by n
Angular CLI on HTTPS - can't install CI as root
n (node version manager): cannot create directory
npm module n emits errors
How to update npm permanently?
Cannot change nodejs version using n
upgrade nodejs to stable version
How should I install and use multiple versions of Node on the same production machine?