Search code examples
tensorflowtensorflow2.0loss-function

How do I handle a custom loss function with (1/(1-exp(-x))-1/x) in it?


I am working on a deep learning model with a ragged tensor where the custom loss function is related to:

f(x)+f(x+50)

and f(x)=1/(1-exp(-x))-1/x when x!=0, f(x)=0.5 when x=0.

f(x) is ranged between 0 and 1 and it is continuous and differentiable for all x. Below is the graph of f(x)

enter image description here

I first tried to implement this function as tf.where(tf.abs(x)<0.1, 0.5+x/12, 1/(1-exp(-x))-1/x) as the gradient at x=0 is 1/12. But problem was the loss became nan after some fitting like below:

Epoch: 0    train_loss: 0.072233    val_loss: 0.052703
Epoch: 10   train_loss: 0.008087    val_loss: 0.041443
Epoch: 20   train_loss: 0.005942    val_loss: 0.029767
Epoch: 30   train_loss: 0.005200    val_loss: 0.026407
Epoch: 40   train_loss: nan val_loss: nan
Epoch: 50   train_loss: nan val_loss: nan

I tried to solve this problem but all of it failed.

  1. I made the code separately calculate f(x) when x<-10 and x>10, too, which is:
tf.where(tf.abs(x)<0.1, 0.5+x/12,
         tf.where(x<-10., -1/x,
                  tf.where(x>10., 1-1/x, 1/(1-tf.exp(-x))-1/x)))

but it gave the same result.

  1. Lowering the learning rate and changing the optimizer gave the same result and started giving nan at a similar training loss to the one above.

  2. I set the default float as float64 by tf.keras.backend.set_floatx('float64'). It managed to train the model further, but again, it started to give the same result at a lower training loss:

Epoch: 0    train_loss: 0.043096    val_loss: 0.050407
Epoch: 10   train_loss: 0.006179    val_loss: 0.034259
Epoch: 20   train_loss: 0.005841    val_loss: 0.034110
...
Epoch: 210  train_loss: 0.003594    val_loss: 0.026524
Epoch: 220  train_loss: nan val_loss: nan
Epoch: 230  train_loss: nan val_loss: nan
  1. Replacing f(x) with the sigmoid function solved the problem. But I really want to use f(x) because it is really meaningful to what I am doing.

I guess some inf/inf, 0/0, or inf-inf occurred while calculating the gradient, but I am not that expert and could not get a more detailed clue. I would be really grateful if you know how to solve this or if you know what I need to look to solve the problem.


Solution

  • This is mostly a problem of catastrophic numerical cancellation than anything else - you can't just compute something in IEEE754 numbers using the algebraic form and expect it to work for very small or large numbers.

    Your definition of f(x) is:

    f(x)=1/(1-exp(-x))-1/x when x!=0, f(x)=0.5 when x=0.
    

    In many languages there is a provided function expxm1 which computes "exp(x)-1" to full machine precision (dating back to the hardware implementation of x87 numeric coprocessor and possibly before that). That may be enough to solve your immediate problems to avoid division by zero for small values of x (<1e-7 for floats, <2e-16 for doubles).

    It seems Tensorflow has such an expxm1 and it's purpose to maintain precision is explained in the help.

    But you can probably do even better by multiplying through and computing it over the common denominator to get slightly better accuracy.

    f(x) = (x - (1-exp(-x))/(x*(1-exp(x))  x !=0
    

    evaluated as

    f(x) = -(x + expxm1(-x))/(x*expxm1(-x))
    

    It will still lose accuracy and return zero for some very small values of x but it should no longer generate Nans when it reaches its precision limit.

    If you really need it to work continuously for any x no matter how small then the fixup is to return the first term of the numerator that doesn't actually cancel out x^2/2 when input x is tiny.

     f(x) = -x^2/(2*x*expm1(-x))   for x << 1e-16
    

    evaluated as

     f(x) = -x/(2*expxm1(-x))     x != 0
    

    This sort of problem is common in numerical calculations where two nearly equal components get subtracted. There are various classical rearrangement tricks to circumvent catastrophic cancelation.