From what I understand, you can use the documents
parameter OR the file
parameter to tell openai on what labels you want to perform a search. I'm getting expected results using the documents
parameter. I am getting unsatisfactory results using the file
parameter. I would expect them to be the same.
When performing a search using the documents
parameter..
response = dict(openai.Engine('davinci').search(
query='sitcom',
#file=file_id,
max_rerank=5,
documents=["white house", "school", "seinfeld"],
return_metadata=False))
..I get expected results.. "sitcom" wins the search with a score of 771.
{'object': 'list', 'data': [<OpenAIObject search_result at 0xb5e8ef48> JSON: {
"document": 0,
"object": "search_result",
"score": 147.98
}, <OpenAIObject search_result at 0xb5ebd148> JSON: {
"document": 1,
"object": "search_result",
"score": 211.021
}, <OpenAIObject search_result at 0xb5ebd030> JSON: {
"document": 2,
"object": "search_result",
"score": 771.348
}], 'model': 'davinci:2020-05-03'}
Now trying with the file
parameter I create a temp.jsonl
file with contents..
{"text": "white house", "metadata": "metadata here"}
{"text": "school", "metadata": "metadata here"}
{"text": "seinfeld", "metadata": "metadata here"}
I then upload the file to openai server with..
res = openai.File.create(file=open('temp.jsonl'), purpose="search")
where..
file_id = res['id']
I wait until the file is processed by the server then..
response = dict(openai.Engine('davinci').search(
query='sitcom',
file=file_id,
max_rerank=5,
#documents=["white house", "school", "seinfeld"],
return_metadata=False))
But I get the following message when I perform search..
No similar documents were found in file with ID 'file-LzHkASUxbDjTAWBhHxHpIOf4'.Please upload more documents or adjust your query.
I only get results when my query exactly matches a label..
response = dict(openai.Engine('davinci').search(
query='seinfeld',
file=file_id,
max_rerank=5,
#documents=["white house", "school", "seinfeld"],
return_metadata=False))
{'object': 'list', 'data': [<OpenAIObject search_result at 0xb5e74f48> JSON: {
"document": 0,
"object": "search_result",
"score": 668.846,
"text": "seinfeld"
}], 'model': 'davinci:2020-05-03'}
What am I doing wrong? Shouldn't the results be the same using the documents
parameter or the file
parameter?
Rereading the docs, it seems, when using file
parameter instead of documents
parameter, the server first performs a basic "keyword" search with the provided query
to narrow down the results before finally reranking those results with a semantic search using the same query
.
This is disappointing.
Just to provide a working example..
{"text": "stairway to the basement", "metadata": "metadata here"}
{"text": "school", "metadata": "metadata here"}
{"text": "stairway to heaven", "metadata": "metadata here"}
Now using the query "led zeppelin's most famous song stairway" the server will narrow down the results to document 0 and document 2 finding matches for the "stairway" token. It will then perform a semantic search and score both of them. Document 2 ("stairway to heaven") will have the highest relevancy score.
Using the query "stairway to the underground floor" will give document 0 ("stairway to the basement") the highest relevancy score.
This is disappointing because the query has to be useful for both a keyword search AND the semantic search.
In my original post, the keyword search was not providing any results because the query was only designed for a semantic search. When using the documents
parameter, only a semantic search is performed, that is why it worked in that case.