I’m new to database management, and in my project, I need to use a vector database to store vector data. I chose Milvus for this purpose. I plan to implement a delete function so users can remove entries, but users might forget the names or IDs they want to delete. To address this, I’m also developing a "list all" function that allows users to view all entries in the database.
Here’s the structure of my database (as a dictionary):
DB = {
"id": id,
"vector": vector,
"file_name": name,
}
Currently, I’m loading all names like this:
client = MilvusClient(r'/my.db')
output = client.query('milvus', filter="id >= 0", output_fields=["file_name"])
This method technically works, but it seems inefficient—loading everything each time a user wants to view the entries doesn’t feel scalable. I’m concerned that, as the dataset grows, this approach might lead to performance issues or even server crashes.
So, my questions are:
Any insights on efficient ways to handle this in Milvus would be greatly appreciated.
If you are looking for: if there is an alternate way to list all the ids from a collection in Milvus Milvus 2 - get list of ids in a collection
The milvus has a method query()
to fetch entities from a collection.
Assume there is a collection named aaa
, it has a field named id
, assume all id values are greater than 0.
collection = Collection("aaa")
result = collection.query(expr="id >= 0")
print(result)
The result is a list, you will see all the ids are in this list.
import random
from pymilvus import (
connections,
utility,
FieldSchema, CollectionSchema, DataType,
Collection,
)
from random import choice
from string import ascii_uppercase
print("start connecting to Milvus")
connections.connect("default", host="localhost", port="19530")
collection_name = "aaa"
if utility.has_collection(collection_name):
utility.drop_collection(collection_name)
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=False),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=128),
FieldSchema(name="file_name", dtype=DataType.VARCHAR, dim=100)
]
schema = CollectionSchema(fields, "aaa")
print("Create collection", collection_name)
collection = Collection(collection_name, schema)
print("Start inserting entities")
num_entities = 10000
for k in range(50):
print('No.', k)
entities = [
# [i for i in range(num_entities)], # duplicate id, the query will get 10000 ids
[i + num_entities*k for i in range(num_entities)], # unique id, the query will get 500000 ids
[[random.random() for _ in range(128)] for _ in range(num_entities)],
[[''.join(choice(ascii_uppercase) for i in range(100))] for _ in range(num_entities)],
]
insert_result = collection.insert(entities)
print(f"Number of entities: {collection.num_entities}")
print("Start loading")
collection.load()
result = collection.query(expr="id >= 0", output_fields=["id", "vector", "file_name"])
print("query result count:", len(result))
But if you are looking for : Is there any way to retrieve these embeddings from milvus collection? Retrieve data from Milvus collection
In Milvus, search()
is to do ANN search, query()
is to retrieve data.
Since milvus is optimized for ANN search, it loads index data in memory, but original embedding data is stay in disk. So, retrieve embeddings is heavy operation and not fast.
The following script is a simple example for how to use query()
:
import random
from pymilvus import (
connections,
FieldSchema, CollectionSchema, DataType,
Collection,
utility,
)
_HOST = '127.0.0.1'
_PORT = '19530'
if __name__ == '__main__':
connections.connect(host=_HOST, port=_PORT)
collection_name = "demo"
if utility.has_collection(collection_name):
utility.drop_collection(collection_name)
# create a collection with these fields: id, tag and vector
dim = 8
field1 = FieldSchema(name="id", dtype=DataType.INT64, is_primary=True)
field2 = FieldSchema(name="vector", dtype=DataType.VARCHAR, max_length=64)
field3 = FieldSchema(name="file_name", dtype=DataType.FLOAT_VECTOR, dim=dim)
schema = CollectionSchema(fields=[field1, field2, field3])
collection = Collection(name="demo", schema=schema)
print("collection created")
# each vector field must have an index
index_param = {
"index_type": "HNSW",
"params": {"M": 48, "efConstruction": 500},
"metric_type": "L2"}
collection.create_index("vector_field", index_param)
# insert 1000 rows, each row has an id , tag and a vector
count = 1000
data = [
[i for i in range(count)],
[f"tag_{i%100}" for i in range(count)],
[[random.random() for _ in range(dim)] for _ in range(count)],
]
collection.insert(data)
print(f"insert {count} rows")
# must load the collection before any search or query operations
collection.load()
# method to retrieve vectors from the collection by filer expression
def retrieve(expr: str):
print("===============================================")
result = collection.query(expr=expr, output_fields=["id", "vector", "file_name"])
print("query result with expression:", expr)
for hit in result:
print(f"id: {hit['id_field']}, tag: {hit['tag_field']}, vector: {hit['vector_field']}")
# get items whose id = 10 or 50
retrieve("id_field in [10, 50]")
# get items whose id <= 3
retrieve("id_field <= 3")
# get items whose tag = "tag_5"
retrieve("tag_field in [\"tag_25\"]")
# drop the collection
collection.drop()
Output of the script:
collection created
insert 1000 rows
===============================================
query result with expression: id_field in [10, 50]
id: 10, tag: tag_10, vector: [0.053770524, 0.83849007, 0.04007046, 0.16028273, 0.2640955, 0.5588169, 0.93378043, 0.031373363]
id: 50, tag: tag_50, vector: [0.082208894, 0.09554817, 0.8288978, 0.984166, 0.0028912988, 0.18656737, 0.26864904, 0.20859942]
===============================================
query result with expression: id_field <= 3
id: 0, tag: tag_0, vector: [0.60005647, 0.5609647, 0.36438486, 0.10851263, 0.65043026, 0.82504696, 0.8862855, 0.79214275]
id: 1, tag: tag_1, vector: [0.3711398, 0.0068489416, 0.004352187, 0.36848867, 0.9881858, 0.9160333, 0.5137728, 0.16045558]
id: 2, tag: tag_2, vector: [0.10995998, 0.24792045, 0.75946856, 0.6824144, 0.5848432, 0.10871549, 0.81346315, 0.5030568]
id: 3, tag: tag_3, vector: [0.38349515, 0.9714319, 0.81812894, 0.387037, 0.8180231, 0.030460497, 0.411488, 0.5743198]
===============================================
query result with expression: tag_field in ["tag_25"]
id: 25, tag: tag_25, vector: [0.8417967, 0.07186894, 0.64750504, 0.5146622, 0.68041337, 0.80861133, 0.6490419, 0.013803678]
id: 125, tag: tag_25, vector: [0.41458654, 0.13030894, 0.21482174, 0.062191084, 0.86997706, 0.4915581, 0.0478688, 0.59728557]
id: 225, tag: tag_25, vector: [0.4143869, 0.26847556, 0.14965168, 0.9563254, 0.7308634, 0.5715891, 0.37524575, 0.19693129]
id: 325, tag: tag_25, vector: [0.07538631, 0.2896633, 0.8130047, 0.9486398, 0.35597774, 0.41200536, 0.76178575, 0.63848394]
id: 425, tag: tag_25, vector: [0.3203018, 0.8246632, 0.28427872, 0.3969012, 0.94882655, 0.7670139, 0.43087512, 0.36356103]
id: 525, tag: tag_25, vector: [0.52027494, 0.2197635, 0.14136001, 0.081981435, 0.10024931, 0.40981093, 0.92328817, 0.32509744]
id: 625, tag: tag_25, vector: [0.2729753, 0.85121, 0.028014379, 0.32854447, 0.5946417, 0.2831049, 0.6444559, 0.57294136]
id: 725, tag: tag_25, vector: [0.98359156, 0.90887356, 0.26763296, 0.33788496, 0.9277225, 0.4743232, 0.5850919, 0.5116082]
id: 825, tag: tag_25, vector: [0.90271956, 0.31777886, 0.8150854, 0.37264413, 0.756029, 0.75934476, 0.07602229, 0.21065433]
id: 925, tag: tag_25, vector: [0.009773289, 0.352051, 0.8339834, 0.4277803, 0.53999937, 0.2620487, 0.4906858, 0.77002776]
Process finished with exit code 0
Update:
you could add an extra parameter to specifying output fields:
result = collection.query(expr="id >= 0", output_fields=["id", "vector", "file_name"])