neural-network loss generative-adversarial-network objective-function

What is the ideal value of loss function for a GAN

GAN originally proposed by IJ Goodfellow uses following loss function,

D_loss = - log[D(X)] - log[1 - D(G(Z))]

G_loss = - log[D(G(Z))]

So, discriminator tries to minimize D_loss and generator tries to minimize G_loss, where X and Z are training input and noise input respectively. D(.) and G(.) are map for discriminator and generator neural networks respectively.

As original paper says, when GAN is trained for several steps it reaches at a point where neither generator nor discriminator can improve and D(Y) is 0.5 everywhere, Y is some input to the discriminator. In this case, when GAN is sufficiently trained to this point,

D_loss = - log(0.5) - log(1 - 0.5) = 0.693 + 0.693 = 1.386

G_loss = - log(0.5) = 0.693

So, why can we not use D_loss and G_loss values as a metric for evaluating GAN?

If two loss functions deviate away from these ideal values then GAN surely needs to be trained well or architecture needs to designed well. As theorem 1 in the original paper discusses that these are the optimal values for the D_loss and G_loss but then why can't these be used as evaluation metric?

Solution

I think this question belongs on Cross-Validated, but anyway :

I struggled with this for quite some time, and wondered why the question wasn't asked. What follows is where I'm currently at. Not sure if it'll help you, but it is some of my intuition.

G and D losses are good indicators of failure cases...
Of course, if G loss is a really big number and D is zero, then nothing good is happening in your GAN.

... but not good indicators of performance.
I've trained a bunch of GANs and have almost never seen the "0.5/0.5 case" except on very simple examples. Most of the time, you're happy when outputs D(x) and D(G(z)) (and therefore, the losses) are more or less stable. So don't take these values for "gold standard".
A key intuition I was missing was in simultaneousity of G and D training. At the beginning, sure G is really bad at generating stuff, but D is also really bad at discriminating them. As time passes, G gets better, but D also gets better. So after many epochs, we can think that D is really good at discriminating between fake and real. Therefore, even if G "fools" D only 5% of the time (i.e. D(x)=0.95 and D(G(z))=0.05) then it can mean that G is actually pretty good because it fools sometimes a really good discriminator.
As you know, there are not reliable metrics of image quality besides looking at it for the moment, but I've found that for my usecases, G could produce great images while fooling D only a few % of the time.
A corrolary to this simultaneous training is what's happening at the beginning of the training : You can have D(X)=0.5 and D(G(Z))=0.5, and still have G produce almost random images : it's just that D is not good enough yet to tell them apart from real images.

I see it's been a couple months since you've posted this question. If you've gained intuition in the meantime, I'd be happy to hear it !