machine-learning probability mle language-model

Is likelihood calculated over the whole training set or a single example?

Suppose I have a training set of (x, y) pairs, where x is the input example and y is the corresponding target and y is a value (1 ... k) (k is the number of classes).

When calculating the likelihood of the training set, should it be calculated for the whole training set (all of the examples), that is:

L = P(y | x) = p(y1 | x1) * p(y2 | x2) * ...

Or is the likelihood computed for a specific training example (x, y)?

I'm asking because I saw these lecture notes (page 2), where he seems to calculate L_i, that is the likelihood for every training example separately.

Solution

The likelihood function describes the probability of generating a set of training data given some parameters and can be used to find those parameters which generate the training data with maximum probability. You can create the likelihood function for a subset of the training data, but that wouldn't be represent the likelihood of the whole data. What you can do however (and what is apparently silently done in the lecture notes) is to assume that your data is independent and identically distributed (iid). Therefore, you can split the joint probability function into smaller pieces, i.e. p(x|theta) = p(x1|theta) * p(x2|theta) * ... (based on the independence assumption), and you can use the same function with the same parameters (theta) for each of these pieces, e.g. a normal distribution (based on the identicality assumption). You can then use the logarithm to turn the product into a sum, i.e. p(x|theta) = p(x1|theta) + p(x2|theta) + .... That function can be maximized by setting its derivative to zero. The resulting maximum is the theta which creates your x with maximum probability, i.e. your maximum likelihood estimator.