Search code examples
searchopenai-api

Openai semantic search not working with the file parameter


From what I understand, you can use the documents parameter OR the file parameter to tell openai on what labels you want to perform a search. I'm getting expected results using the documents parameter. I am getting unsatisfactory results using the file parameter. I would expect them to be the same.

When performing a search using the documents parameter..

response = dict(openai.Engine('davinci').search(
    query='sitcom',
    #file=file_id,
    max_rerank=5,
    documents=["white house", "school", "seinfeld"],
    return_metadata=False))

..I get expected results.. "sitcom" wins the search with a score of 771.

{'object': 'list', 'data': [<OpenAIObject search_result at 0xb5e8ef48> JSON: {
  "document": 0,
  "object": "search_result",
  "score": 147.98
}, <OpenAIObject search_result at 0xb5ebd148> JSON: {
  "document": 1,
  "object": "search_result",
  "score": 211.021
}, <OpenAIObject search_result at 0xb5ebd030> JSON: {
  "document": 2,
  "object": "search_result",
  "score": 771.348
}], 'model': 'davinci:2020-05-03'}

Now trying with the file parameter I create a temp.jsonl file with contents..

{"text": "white house", "metadata": "metadata here"}
{"text": "school", "metadata": "metadata here"}
{"text": "seinfeld", "metadata": "metadata here"}

I then upload the file to openai server with..

res = openai.File.create(file=open('temp.jsonl'), purpose="search")

where..

file_id = res['id']

I wait until the file is processed by the server then..

response = dict(openai.Engine('davinci').search(
    query='sitcom',
    file=file_id,
    max_rerank=5,
    #documents=["white house", "school", "seinfeld"],
    return_metadata=False))

But I get the following message when I perform search..

No similar documents were found in file with ID 'file-LzHkASUxbDjTAWBhHxHpIOf4'.Please upload more documents or adjust your query.

I only get results when my query exactly matches a label..

response = dict(openai.Engine('davinci').search(
    query='seinfeld',
    file=file_id,
    max_rerank=5,
    #documents=["white house", "school", "seinfeld"],
    return_metadata=False))

{'object': 'list', 'data': [<OpenAIObject search_result at 0xb5e74f48> JSON: {
  "document": 0,
  "object": "search_result",
  "score": 668.846,
  "text": "seinfeld"
}], 'model': 'davinci:2020-05-03'}

What am I doing wrong? Shouldn't the results be the same using the documents parameter or the file parameter?


Solution

  • Rereading the docs, it seems, when using file parameter instead of documents parameter, the server first performs a basic "keyword" search with the provided query to narrow down the results before finally reranking those results with a semantic search using the same query.

    This is disappointing.

    Just to provide a working example..

    {"text": "stairway to the basement", "metadata": "metadata here"}
    {"text": "school", "metadata": "metadata here"}
    {"text": "stairway to heaven", "metadata": "metadata here"}
    

    Now using the query "led zeppelin's most famous song stairway" the server will narrow down the results to document 0 and document 2 finding matches for the "stairway" token. It will then perform a semantic search and score both of them. Document 2 ("stairway to heaven") will have the highest relevancy score.

    Using the query "stairway to the underground floor" will give document 0 ("stairway to the basement") the highest relevancy score.

    This is disappointing because the query has to be useful for both a keyword search AND the semantic search.

    In my original post, the keyword search was not providing any results because the query was only designed for a semantic search. When using the documents parameter, only a semantic search is performed, that is why it worked in that case.