Search code examples
nlpmatchingn-gramscoringlexicon

Applied NLP: how to score a document against a lexicon of multi-word terms?


This is probably a fairly basic NLP question but I have the following task at hand: I have a collection of text documents that I need to score against an (English) lexicon of terms that could be 1-, 2-, 3- etc N-word long. N is bounded by some "reasonable" number but the distribution of various terms in the dictionary for various values of n = 1, ..., N might be fairly uniform. This lexicon can, for example, contain a list of devices of certain type and I want to see if a given document is likely about any of these devices. So I would want to score a document high(er) if it has one or more occurrences of any of the lexicon entries.

What is a standard NLP technique to do the scoring while accounting for various forms of the words that may appear in the lexicon? What sort of preprocessing would be required for both the input documents and the lexicon to be able to perform the scoring? What sort of open-source tools exist for both the preprocessing and the scoring?


Solution

  • I studied LSI and topic modeling almost a year ago, so what I say should be taken as merely a pointer to give you a general idea of where to look.

    There are many different ways to do this with varying degrees of success. This is a hard problem in the realm of information retrieval. You can search for topic modeling to learn about different options and state of the art.

    You definitely need some preprocessing and normalization if the words could appear in different forms. How about NLTK and one of its stemmers:

    >>> from nltk.stem.lancaster import LancasterStemmer
    >>> st = LancasterStemmer()
    >>> st.stem('applied')
    'apply'
    >>> st.stem('applies')
    'apply'
    

    You have a lexicon of terms that I am going to call terms and also a bunch of documents. I am going to explore a very basic technique to rank documents with regards to the terms. There are a gazillion more sophisticated ways you can read about, but I think this might be enough if you are not looking for something too sophisticated and rigorous.

    This is called a vector space IR model. Terms and documents are both converted to vectors in a k-dimensional space. For that we have to construct a term-by-document matrix. This is a sample matrix in which the numbers represent frequencies of the terms in documents:

    enter image description here

    So far we have a 3x4 matrix using which each document can be expressed by a 3-dimensional array (each column). But as the number of terms increase, these arrays become too large and increasingly sparse. Also, there are many words such as I or and that occur in most of the documents without adding much semantic content. So you might want to disregard these types of words. For the problem of largeness and sparseness, you can use a mathematical technique called SVD that scales down the matrix while preserving most of the information it contains.

    Also, the numbers we used on the above chart were raw counts. Another technique would be to use Boolean values: 1 for presence and 0 zero for lack of a term in a document. But these assume that words have equal semantic weights. In reality, rarer words have more weight than common ones. So, a good way to edit the initial matrix would be to use ranking functions like tf-id to assign relative weights to each term. If by now we have applied SVD to our weighted term-by-document matrix, we can construct the k-dimensional query vectors, which are simply an array of the term weights. If our query contained multiple instances of the same term, the product of the frequency and the term weight would have been used.

    What we need to do from there is somewhat straightforward. We compare the query vectors with document vectors by analyzing their cosine similarities and that would be the basis for the ranking of the documents relative to the queries.