Search code examples
nlpstanford-nlpinformation-retrievaln-gramlanguage-model

What does 'theta' mean in a language model?


I know that if X denotes a text , p(X) denotes the language model of the text. And most often , we use maximum likelihood estimation to estimate the language model. But in many cases , I find a parameter $\theta$ used to represent a language model. I don't understand the meaning of this $\theta$ . For Example , for a document d in a collection what purpose does $\theta$ serve in ' p(d|$\theta$) ' ?

Does $\theta$ represent a maximum likelihood estimator or a language model ?

Can someone please explain this difference between a language model and $\theta$ in depth ?

Thanks in advance !


Solution

  • \theta is a conventional/standard machine learning notation indicating (strictly speaking) a set of parameter (values), often more commonly known as the parameter vector.

    The notation P(Y|X;\theta) is to read as the y-values (e.g. MNIST digit labels) are predicted from the x-values (e.g. input images of MNIST digits) with the help of a trained model that is trained on annotated (X,Y) pairs. This model is parameterized by \theta. Obviously, if the training algorithm changes, so will the parameter vector \theta.

    The structure of these parameter vectors are usually interpreted from the model they are associated with, e.g. for multi-layered neural networks they indicate real-valued vectors initially randomly assigned and then updated by gradient descent at each iteration.

    For word generation based language models, they refer to the probability of a word v following a word u, meaning that each element is an entry in a hash-table of the form (u, v) --> count(u.v)/count(u). These probabilities are learned from a training collection, C of documents, as a result of which they essentially become a function of the training set. For a different collection, these probability values will be different.

    Hence, the usual convention is to write P(w_n|P_w_{n-1};\theta), which basically indicates that these word succession probabilities are parameterized by \theta.

    A similar argument applies for document-level language models in information retrieval, where the weights essentially indicate probabilities of sampling terms from documents.