Search code examples
pythonredis

Vectors in Redis Search index are corrupted even though index searches work correctly


I have a redis cache using Redis Search and an HNSW index on a 512 element vector of float32 values.

It is defined like this:

schema = (
    VectorField(
        "vector",
        "HNSW",
        {
            "TYPE": "FLOAT32",
            "DIM": 512,
            "DISTANCE_METRIC": "IP",
            "EF_RUNTIME": 400,
            "EPSILON": 0.4
        },
        as_name="vector"
    ),
)

definition = IndexDefinition(prefix=[REDIS_PREFIX], index_type=IndexType.HASH)
res = client.ft(REDIS_INDEX_NAME).create_index(
    fields=schema, definition=definition
)

I can insert numpy float32 vectors into this index by writing the result of vector.tobytes() into them directly. I can then accurately query those same vectors using a vector similarity search.

Despite this working correctly, when I read these vectors out of the cache using client.hget(key, "vector") I get results that are a variable number of bytes. All of these vectors are definitely 512 elements when I insert them, but sometimes they come back as a number of bytes that isn't even a multiple of 4! I can't decode them back into a numpy vector at that point.

I can't tell if this is a bug, or if I'm doing something wrong. Either way, something clearly isn't right.

Edit: I've discovered that the records that are corrupted aren't actually in the index (if I'm interpreting this right).

I check whether or not a record is in the index by running

client.ft(REDIS_INDEX_NAME).execute_command("FT.SEARCH", REDIS_INDEX_NAME, "*", f"INKEYS", "1", key)

This returns nothing when the record is not in the index. I'm now questioning whether or not I somehow wrote a number of corrupted records to this database with an old piece of code that has since been fixed. This might be the explanation.

Edit 2: The corrupted records are distributed evenly throughout the database by insertion time, so this isn't an issue of some old code that was buggy and has since been fixed.


Solution

  • I've discovered the issue. I've been using this vector index for de-duplication purposes (by checking the index for records with a high cosine similarity to new records before adding them). During this process, I sometimes update other fields on the records.

    In cases where a near duplicate is discovered, I'll update the non-vector fields on the records and write them back. The problem is that I read the entire record when I perform the duplicate check. Redis tries to decode the raw byte representation of the vector as a string, and in about 50% of cases, that vector can't be decoded as a string. Rather than raising an error, it returns a corrupt vector.

    Because I wasn't carefully pruning the field returned from the search before adding them back into the index, I was adding the corrupted vector back into the index.

    I definitely deserve some of the blame for this, but the fact that RedisSearch would fail to decode a field and then return it corrupted (without any error message) seems like a bug to me. Especially since this was the result of a vector search on that field. It would have been easy for the client to automatically determine that the field should not be decoded.