Generative and discriminative models seem to learn conditional P(x|y) and joint P(x,y) probability distributions. But at the fundamental level I fail to convince myself what it means by the probability distribution is learnt.
It means that your model is either functioning as an estimator for the distribution from which your training samples were drawn, or is utilizing that estimator to perform some other prediction.
To give a trivial example, consider a set of observations {x[1], ..., x[N]}
. Let's say you want to train a Gaussian estimator on it. From these samples, the maximum-likelihood parameters for this Gaussian estimator would be the mean and variance of the data
Mean = 1/N * (x[1] + ... + x[N])
Variance = 1/(N-1) * ((x[1] - Mean)^2 + ... + (x[N] - Mean)^2)
Now you have a model capable of generating new samples from (an estimate of) the distribution your training sample was drawn from.
Going a little more sophisticated, you could consider something like a Gaussian mixture model. This similarly infers the best-fitting parameters of a model given your data. Except this time, that model is comprised of multiple Gaussians. As a result, if you are given some test data, you may probabilistically assign classes to each of those samples, based on the relative contribution of each Gaussian component to the probability density at the points of observation. This of course makes the fundamental assumption of machine learning: your training and test data are both drawn from the same distribution (something you ought to check).