Search code examples
deep-learningembeddingword-embeddingvector-database

How to find closest embedding vectors?


I have 100K known embedding i.e.

[emb_1, emb_2, ..., emb_100000]

Each of this embedding is derived from GPT-3 sentence embedding with dimension 2048.

My task is given an embedding(embedding_new) find the closest 10 embedding from the above 100k embedding.

The way I am approaching this problem is brute force.

Every time a query asks to find the closest embeddings, I compare embedding_new with [emb_1, emb_2, ..., emb_100000] and get the similarity score.

Then I do quicksort of the similarity score to get the top 10 closest embedding.

Alternatively, I have also thought about using Faiss.

Is there a better way to achieve this?


Solution

  • I found a solution using Vector Database Lite (VDBLITE)

    VDBLITE here: https://pypi.org/project/vdblite/

    import vdblite
    from time import time
    from uuid import uuid4
    import sys
    from pprint import pprint as pp
    
    
    if __name__ == '__main__':
        vdb = vdblite.Vdb()
        dimension = 12    # dimensions of each vector                         
        n = 200    # number of vectors                   
        np.random.seed(1)             
        db_vectors = np.random.random((n, dimension)).astype('float32')
        print(db_vectors[0])
        for vector in db_vectors:
            info = {'vector': vector, 'time': time(), 'uuid': str(uuid4())}
            vdb.add(info)
        vdb.details()
        results = vdb.search(db_vectors[10])
        pp(results)
    

    Looks like it uses FAISS behind the scene.