Search code examples
pythonvector-databasemilvus

Retrieve All Entries in Milvus Vector Database Just for Viewing?


I’m new to database management, and in my project, I need to use a vector database to store vector data. I chose Milvus for this purpose. I plan to implement a delete function so users can remove entries, but users might forget the names or IDs they want to delete. To address this, I’m also developing a "list all" function that allows users to view all entries in the database.

Here’s the structure of my database (as a dictionary):

DB = {
    "id": id,
    "vector": vector,
    "file_name": name,
}

Currently, I’m loading all names like this:

client = MilvusClient(r'/my.db')
output = client.query('milvus', filter="id >= 0", output_fields=["file_name"])

This method technically works, but it seems inefficient—loading everything each time a user wants to view the entries doesn’t feel scalable. I’m concerned that, as the dataset grows, this approach might lead to performance issues or even server crashes.

So, my questions are:

  1. Is this approach logical and scalable?
  2. Is there a better way to retrieve all file_name entries from Milvus without loading everything?

Any insights on efficient ways to handle this in Milvus would be greatly appreciated.


Solution

  • If you are looking for: if there is an alternate way to list all the ids from a collection in Milvus Milvus 2 - get list of ids in a collection

    The milvus has a method query() to fetch entities from a collection. Assume there is a collection named aaa, it has a field named id, assume all id values are greater than 0.

    collection = Collection("aaa")
    result = collection.query(expr="id >= 0")
    print(result)
    

    The result is a list, you will see all the ids are in this list.

    import random
    from pymilvus import (
        connections,
        utility,
        FieldSchema, CollectionSchema, DataType,
        Collection,
    )
    from random import choice
    from string import ascii_uppercase
    
    
    
    print("start connecting to Milvus")
    connections.connect("default", host="localhost", port="19530")
    
    collection_name = "aaa"
    if utility.has_collection(collection_name):
        utility.drop_collection(collection_name)
    
    fields = [
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
        FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128),
        FieldSchema(name="file_name", dtype=DataType.VARCHAR, dim=100)
    ]
    
    schema = CollectionSchema(fields, "aaa")
    
    print("Create collection", collection_name)
    collection = Collection(collection_name, schema)
    
    print("Start inserting entities")
    num_entities = 10000
    for k in range(50):
        print('No.', k)
        entities = [
            # [i for i in range(num_entities)], # duplicate id, the query will get 10000 ids
            [i + num_entities*k for i in range(num_entities)],  # unique id, the query will get 500000 ids
            [[random.random() for _ in range(128)] for _ in range(num_entities)],
            [[''.join(choice(ascii_uppercase) for i in range(100))] for _ in range(num_entities)],
        ]
        insert_result = collection.insert(entities)
    
    print(f"Number of entities: {collection.num_entities}")
    print("Start loading")
    collection.load()
    
    result = collection.query(expr="id >= 0", output_fields=["id", "vector", "file_name"])
    print("query result count:", len(result))
    

    But if you are looking for : Is there any way to retrieve these embeddings from milvus collection? Retrieve data from Milvus collection

    In Milvus, search() is to do ANN search, query() is to retrieve data. Since milvus is optimized for ANN search, it loads index data in memory, but original embedding data is stay in disk. So, retrieve embeddings is heavy operation and not fast. The following script is a simple example for how to use query():

    import random
    
    from pymilvus import (
        connections,
        FieldSchema, CollectionSchema, DataType,
        Collection,
        utility,
    )
    
    _HOST = '127.0.0.1'
    _PORT = '19530'
    
    if __name__ == '__main__':
        connections.connect(host=_HOST, port=_PORT)
    
        collection_name = "demo"
        if utility.has_collection(collection_name):
            utility.drop_collection(collection_name)
    
        # create a collection with these fields: id, tag and vector
        dim = 8
        field1 = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True)
        field2 = FieldSchema(name="vector", dtype=DataType.VARCHAR, max_length=64)
        field3 = FieldSchema(name="file_name", dtype=DataType.FLOAT_VECTOR, dim=dim)
        schema = CollectionSchema(fields=[field1, field2, field3])
        collection = Collection(name="demo", schema=schema)
        print("collection created")
    
        # each vector field must have an index
        index_param = {
            "index_type": "HNSW",
            "params": {"M": 48, "efConstruction": 500},
            "metric_type": "L2"}
        collection.create_index("vector_field", index_param)
    
        # insert 1000 rows, each row has an id , tag and a vector
        count = 1000
        data = [
            [i for i in range(count)],
            [f"tag_{i%100}" for i in range(count)],
            [[random.random() for _ in range(dim)] for _ in range(count)],
        ]
        collection.insert(data)
        print(f"insert {count} rows")
    
        # must load the collection before any search or query operations
        collection.load()
    
        # method to retrieve vectors from the collection by filer expression
        def retrieve(expr: str):
            print("===============================================")
            result = collection.query(expr=expr, output_fields=["id", "vector", "file_name"])
            print("query result with expression:", expr)
            for hit in result:
                print(f"id: {hit['id_field']}, tag: {hit['tag_field']}, vector: {hit['vector_field']}")
    
        # get items whose id = 10 or 50
        retrieve("id_field in [10, 50]")
    
        # get items whose id <= 3
        retrieve("id_field <= 3")
    
        # get items whose tag = "tag_5"
        retrieve("tag_field in [\"tag_25\"]")
    
        # drop the collection
        collection.drop()
    
    

    Output of the script:

    collection created
    insert 1000 rows
    ===============================================
    query result with expression: id_field in [10, 50]
    id: 10, tag: tag_10, vector: [0.053770524, 0.83849007, 0.04007046, 0.16028273, 0.2640955, 0.5588169, 0.93378043, 0.031373363]
    id: 50, tag: tag_50, vector: [0.082208894, 0.09554817, 0.8288978, 0.984166, 0.0028912988, 0.18656737, 0.26864904, 0.20859942]
    ===============================================
    query result with expression: id_field <= 3
    id: 0, tag: tag_0, vector: [0.60005647, 0.5609647, 0.36438486, 0.10851263, 0.65043026, 0.82504696, 0.8862855, 0.79214275]
    id: 1, tag: tag_1, vector: [0.3711398, 0.0068489416, 0.004352187, 0.36848867, 0.9881858, 0.9160333, 0.5137728, 0.16045558]
    id: 2, tag: tag_2, vector: [0.10995998, 0.24792045, 0.75946856, 0.6824144, 0.5848432, 0.10871549, 0.81346315, 0.5030568]
    id: 3, tag: tag_3, vector: [0.38349515, 0.9714319, 0.81812894, 0.387037, 0.8180231, 0.030460497, 0.411488, 0.5743198]
    ===============================================
    query result with expression: tag_field in ["tag_25"]
    id: 25, tag: tag_25, vector: [0.8417967, 0.07186894, 0.64750504, 0.5146622, 0.68041337, 0.80861133, 0.6490419, 0.013803678]
    id: 125, tag: tag_25, vector: [0.41458654, 0.13030894, 0.21482174, 0.062191084, 0.86997706, 0.4915581, 0.0478688, 0.59728557]
    id: 225, tag: tag_25, vector: [0.4143869, 0.26847556, 0.14965168, 0.9563254, 0.7308634, 0.5715891, 0.37524575, 0.19693129]
    id: 325, tag: tag_25, vector: [0.07538631, 0.2896633, 0.8130047, 0.9486398, 0.35597774, 0.41200536, 0.76178575, 0.63848394]
    id: 425, tag: tag_25, vector: [0.3203018, 0.8246632, 0.28427872, 0.3969012, 0.94882655, 0.7670139, 0.43087512, 0.36356103]
    id: 525, tag: tag_25, vector: [0.52027494, 0.2197635, 0.14136001, 0.081981435, 0.10024931, 0.40981093, 0.92328817, 0.32509744]
    id: 625, tag: tag_25, vector: [0.2729753, 0.85121, 0.028014379, 0.32854447, 0.5946417, 0.2831049, 0.6444559, 0.57294136]
    id: 725, tag: tag_25, vector: [0.98359156, 0.90887356, 0.26763296, 0.33788496, 0.9277225, 0.4743232, 0.5850919, 0.5116082]
    id: 825, tag: tag_25, vector: [0.90271956, 0.31777886, 0.8150854, 0.37264413, 0.756029, 0.75934476, 0.07602229, 0.21065433]
    id: 925, tag: tag_25, vector: [0.009773289, 0.352051, 0.8339834, 0.4277803, 0.53999937, 0.2620487, 0.4906858, 0.77002776]
    
    Process finished with exit code 0
    
    

    Update:

    you could add an extra parameter to specifying output fields:

    result = collection.query(expr="id >= 0", output_fields=["id", "vector", "file_name"])