Search code examples
retrieve-and-rank

watson retrieve-and-rank - manual ranking


I am trying to build a ranker for a demonstration. I did the "automatic training" and i got OK results (could be better) I am trying to go into manual training but I am confuse about the meaning of the parameters from the Bluemix online documentation: https://www.ibm.com/watson/developercloud/doc/retrieve-rank/training_data.shtml#manual

Could some one please explain in the following Bluemix sample data?

query_id, feature1, feature2, feature3,...,ground_truth
question_id_1, 0.0, 3.4, -900,...,0
question_id_1, 0.5, -70, 0,...,1
question_id_1, 0.0, -100, 20,...,3
...

what is query__id? (what does it represent?) what is feature1, feature2? (what does it represent?) what is question_id_1? (what does it represent?) and how to those score are calculated (the 0.0, 3.4, -900)?

I understood that ground_truth value must go from 0 to 4, (0 meaning not relevant at all, to 4 meaning perfect match) is that correct ?

Kind regards Xavier


Solution

  • The training data is meant to train a learning-to-rank (L2R) algorithm. The L2R approach is to first take a list of candidate answers (e.g. documents in a search result page) that were generated in response to a query (aka question) and represent each query-answer pair as a set of features. Each feature hopefully captures some representation of how well that particular candidate answer matches the query. Each line in the training data represents the feature values belonging to one of these query-answer pairs.

    Because the training data contains feature vectors from lots of different queries (and corresponding search results), the first column uses a query id to tie together different candidate answers that were generated in response to a single query.

    As you said, the last column simple captures whether a human annotator believed that the answer was actually relevant to the question or not. The 0-4 scale is not mandatory. 0 always represents irrelevant. But after that you can use whatever scale makes sense for your use case (often people just use a 0-1 binary scale when there is limited data since this reduces complexity).

    The python script made available on the documentation page that you referenced will actually go through the process of generating candidate answers and corresponding feature vectors given a file containing different queries. You may wish to step through the code in that script to get a better idea of how you might create your training data.