Search code examples
pythondatabasemathredisvectorization

How can I prioritize dimensions in a Redis vector similarity search?


I am currently using Redis as a vector database and was able to get a similarity search going with 3 dimensions (the dimensions being latitude, longitude, and timestamp). The similarity search is working but I would like to weigh certain dimensions differently when conducting the search. Namely, I would like the similarity search to prioritize the timestamp dimension when conducting the search.

How would I go about this? Redis does not seem to have any built-in feature that does this.

I turn each set of lat, long, and time coordinates into bytes that can be put into the vector database with the following code. Note that vector_dict stores all the sets of lat, long, and timestamp:

p = client.pipeline(transaction=False)
for index in data:
        # create hash key
        key = keys[index]

        # create hash values
        item_metadata = data[index] # copy all metadata
        item_key_vector = np.array(vector_dict[index]).astype(np.float32).tobytes() # convert vector to bytes
        p.hset(key, mapping=item_metadata) # add item to redis using hash key and metadata

I then conduct the similarity search using the HNSW index here:

def create_hnsw_index(redis_conn, vector_field_name, number_of_vectors, vector_dimensions=3, distance_metric='L2', M=100, EF=100):
    redis_conn.ft().create_index([
        VectorField(vector_field_name, "HNSW", {"TYPE": "FLOAT32", "DIM": vector_dimensions, "DISTANCE_METRIC": distance_metric, "INITIAL_CAP": number_of_vectors, "M": M, "EF_CONSTRUCTION": EF})
    ])

I talked with others and they said it is a math problem that deals with vector normalization. I'm unsure how to get started with this though in code and would like some guidance.


Solution

  • You can re-weight the vector to make certain dimensions longer than others. You're using an L2 distance metric. That uses the standard Pythagorean theorem to calculate distance:

    dist = sqrt((x1-x2)**2 + (y1-y2)**2 + (z1-z2)**2)
    

    Imagine you multiplied every Y value, in both your query and your database, by 10. That would also multiply the difference between Y values by a factor of 10.

    The new distance function would effectively be this:

    dist = sqrt((x1-x2)**2 + (10*(y1-y2))**2 + (z1-z2)**2)
    
    dist = sqrt((x1-x2)**2 + 100*(y1-y2)**2 + (z1-z2)**2)
    

    ...which makes the Y dimension matter 100 times more than the other dimensions.

    So if you want the dimension in position 2 to matter more, you could do this:

    item_key_vector = np.array(vector_dict[index])
    item_key_vector[2] *= 10
    item_key_vector_bytes = item_key_vector.astype(np.float32).tobytes()
    

    The specific amount to multiply by depends on how much you want the timestamp to matter. Remember that you need to multiply your query vector by the same amount.