I know that if X denotes a text , p(X) denotes the language model of the text. And most often , we use maximum likelihood estimation to estimate the language model. But in many cases , I find a parameter $\theta$ used to represent a language model. I don't understand the meaning of this $\theta$ . For Example , for a document d in a collection what purpose does $\theta$ serve in ' p(d|$\theta$) ' ?
Does $\theta$ represent a maximum likelihood estimator or a language model ?
Can someone please explain this difference between a language model and $\theta$ in depth ?
Thanks in advance !
\theta
is a conventional/standard machine learning notation indicating (strictly speaking) a set of parameter (values), often more commonly known as the parameter vector.
The notation P(Y|X;\theta)
is to read as the y-values (e.g. MNIST digit labels) are predicted from the x-values (e.g. input images of MNIST digits) with the help of a trained model that is trained on annotated (X,Y) pairs. This model is parameterized by \theta
. Obviously, if the training algorithm changes, so will the parameter vector \theta
.
The structure of these parameter vectors are usually interpreted from the model they are associated with, e.g. for multi-layered neural networks they indicate real-valued vectors initially randomly assigned and then updated by gradient descent at each iteration.
For word generation based language models, they refer to the probability of a word v
following a word u
, meaning that each element is an entry in a hash-table of the form (u, v) --> count(u.v)/count(u)
.
These probabilities are learned from a training collection, C
of documents, as a result of which they essentially become a function of the training set. For a different collection, these probability values will be different.
Hence, the usual convention is to write P(w_n|P_w_{n-1};\theta)
, which basically indicates that these word succession probabilities are parameterized by \theta
.
A similar argument applies for document-level language models in information retrieval, where the weights essentially indicate probabilities of sampling terms from documents.