Search code examples
metadataweaviate

Query large list of metadate in weaviate


I have 100.000 images, each of them have 500 orb vectors, and each image has a unique tag.

My general issue is, when I insert a new image (i.e. 500 new vectors), how can I know if the image's tag is already in the database ?

What I do is to attache to each vector a metadata "tag". In can retrieve the inserted tags with

    result = client.query.get('orb_vector', ['tag'])\
        .with_limit(200)\
        .do()

This provides more or less 200 tags among the 100.000 existing.

Accordingly to the documentation, that way of doing is not scalable.

How do I do ?

Context:

  • My database is not very dynamic; apart of the initial big insertion (100.000+ images), there will be few insertions each day. So I'm okay with a request taking 5 minutes and keeping the result in memory in a non-dynamic way. Plain python list is okay.

  • Clarification: each image has one tag, but 500 vectors. So each tag is present 500 times in the database.

  • I'm using python.

What I can do:

Writing the list of tags in a json/mongo/other and reading/updating it each time I insert new images. I prefer to avoid this solution since the synchronization between the weaviate database and the json will just be a nightmare.


Solution

  • Have you considered creating a separate class for the tags and using query filters?

    For example, define a schema for a class named Tag where:

    1. it has a property called "name" to store the tag's name e.g. outdoors, indoors, etc

    2. it has a property called "images" to store the cross references to the images that are tagged with "outdoors".

    Then, when you want to insert an image with tag "car", for example, you do a WHERE filter on the Tag class where the name name is Equal to "car".

    If the result is empty, then that tag does not exist.