Search code examples
searchfull-text-searchinformation-retrievaltf-idf

Information Retrieval: How to calculate tf-idf for multiple search terms?


I have a corpus of the following 4 documents:

<1> This is the first document.
<2> And this is the second document.
<3> The third document is longer than the first and second one.
<4> This is the last document.

And use the search queue "first OR last", how am I supposed to calculate the tf-idf?

Currently I'm using this:

tf(x, D) = raw frequency of term x in document D / raw frequency of most occurring term in D

idf(x) = log(1 + total number of documents / number of documents containing x)

So for queue I get
<1> = (1 / 1) * log(1 + 4/3)
<3> = (1 / 2) * log(1 + 4/3)
<4> = (1 / 1) * log(1 + 4/3)

Is this correct? How do you do this properly? Do I calculate the value for all search terms separately and then add? multiply?


Solution

  • Assuming that you mean "search query" when you say "search queue" and your query is constructed with a logical operator OR, you may construct a flow that increments frequencies when one of the terms are encountered. This is actually what you have done above.

    As you said in your post, another approach would be computing sums of vectors of terms after computing their vectors separately. However, multiplying would not be the option you are looking for.

    Thus, either way you construct an abstract term out of multiple terms by computing this way.