I have a redis cache using Redis Search and an HNSW index on a 512 element vector of float32 values.
It is defined like this:
schema = (
VectorField(
"vector",
"HNSW",
{
"TYPE": "FLOAT32",
"DIM": 512,
"DISTANCE_METRIC": "IP",
"EF_RUNTIME": 400,
"EPSILON": 0.4
},
as_name="vector"
),
)
definition = IndexDefinition(prefix=[REDIS_PREFIX], index_type=IndexType.HASH)
res = client.ft(REDIS_INDEX_NAME).create_index(
fields=schema, definition=definition
)
I can insert numpy float32 vectors into this index by writing the result of vector.tobytes()
into them directly. I can then accurately query those same vectors using a vector similarity search.
Despite this working correctly, when I read these vectors out of the cache using client.hget(key, "vector")
I get results that are a variable number of bytes. All of these vectors are definitely 512 elements when I insert them, but sometimes they come back as a number of bytes that isn't even a multiple of 4! I can't decode them back into a numpy vector at that point.
I can't tell if this is a bug, or if I'm doing something wrong. Either way, something clearly isn't right.
Edit: I've discovered that the records that are corrupted aren't actually in the index (if I'm interpreting this right).
I check whether or not a record is in the index by running
client.ft(REDIS_INDEX_NAME).execute_command("FT.SEARCH", REDIS_INDEX_NAME, "*", f"INKEYS", "1", key)
This returns nothing when the record is not in the index. I'm now questioning whether or not I somehow wrote a number of corrupted records to this database with an old piece of code that has since been fixed. This might be the explanation.
Edit 2: The corrupted records are distributed evenly throughout the database by insertion time, so this isn't an issue of some old code that was buggy and has since been fixed.
I've discovered the issue. I've been using this vector index for de-duplication purposes (by checking the index for records with a high cosine similarity to new records before adding them). During this process, I sometimes update other fields on the records.
In cases where a near duplicate is discovered, I'll update the non-vector fields on the records and write them back. The problem is that I read the entire record when I perform the duplicate check. Redis tries to decode the raw byte representation of the vector as a string, and in about 50% of cases, that vector can't be decoded as a string. Rather than raising an error, it returns a corrupt vector.
Because I wasn't carefully pruning the field returned from the search before adding them back into the index, I was adding the corrupted vector back into the index.
I definitely deserve some of the blame for this, but the fact that RedisSearch would fail to decode a field and then return it corrupted (without any error message) seems like a bug to me. Especially since this was the result of a vector search on that field. It would have been easy for the client to automatically determine that the field should not be decoded.