Natural language query preprocessing

I am trying to implement a natural language query preprocessing module which would, given a query formulated in natural language, extract the keywords from that query and submit it to an Information Retrieval (IR) system.

At first, I thought about using some training set to compute tf-idf values of terms and use these values for estimating the importance of single words. But on second thought, this does not make any sense in this scenario - I only have a training collection but I dont have access to index the IR data. Would it be reasonable to only use the idf value for such estimation? Or maybe another weighting approach?

Could you suggest how to tackle this problem? Usually, the articles about NLP processing that I read address training and test data sets. But what if I only have the query and training data?

Solution

tf-idf (it's not capitalized, fyi) is a good choice of weight. Your intuition is correct here. However, you don't compute tf-idf on your training set alone. Why? You need to really understand what the tf and idf mean:

tf (term frequency) is a statistic that indicates whether a term appears in the document being evaluated. The simplest way to calculate it would simply be a boolean value, i.e. 1 if the term is in the document.

idf (inverse document frequency), on the other hand, measures how likely a term appears in a random document. It's most often calculated as the log of (N/number of document matches).

Now, tf is calculated for each of the document your IR system will be indexing over (if you don't have the access to do this, then you have a much bigger and insurmountable problem, since an IR without a source of truth is an oxymoron). Ideally, idf is calculated over your entire data set (i.e. all the documents you are indexing), but if this is prohibitively expensive, then you can random sample your population to create a smaller data set, or use a training set such as the Brown corpus.