redis publish-subscribe amazon-elasticache

Some keyspace expired events are not firing (Redis)

I have a Redis inside AWS ElastiCache with key space notification activated (Kx).

My problem is that sometimes the expired event is not triggered when a key expires. I am monitoring the key TTL and I am literally watching it expire using the redis-cli, but no event is triggered (i.e. the expired key never appears in psubscribe __keyspace@*__:*). To make it weirder, if I set the exact same key again with a new expire time, the event is triggered. It almost seems like if it is random when the key event is triggered or not the first time. But the 2nd try it always trigger.

I have tried with different TTLs between 1 and 60 seconds but no difference.

I really don't know what I am missing here.

The node type is a cache.t3.medium. Here is the INFO command result:

# Server
redis_version:5.0.6
redis_git_sha1:0
redis_git_dirty:0
redis_build_id:0
redis_mode:standalone
os:Amazon ElastiCache
arch_bits:64
multiplexing_api:epoll
atomicvar_api:atomic-builtin
gcc_version:0.0.0
process_id:1
run_id:29bd78297b28d1e4cd37a3bf51129f87eb6a9bd2
tcp_port:6379
uptime_in_seconds:734256
uptime_in_days:8
hz:10
configured_hz:10
lru_clock:15501077
executable:-
config_file:-

# Clients
connected_clients:50
client_recent_max_input_buffer:2
client_recent_max_output_buffer:0
blocked_clients:0

# Memory
used_memory:1383823976
used_memory_human:1.29G
used_memory_rss:1392582656
used_memory_rss_human:1.30G
used_memory_peak:2493376432
used_memory_peak_human:2.32G
used_memory_peak_perc:55.50%
used_memory_overhead:30088284
used_memory_startup:4263632
used_memory_dataset:1353735692
used_memory_dataset_perc:98.13%
allocator_allocated:1384415096
allocator_active:1393147904
allocator_resident:1419923456
used_memory_lua:45056
used_memory_lua_human:44.00K
used_memory_scripts:512
used_memory_scripts_human:512B
number_of_cached_scripts:2
maxmemory:2488396677
maxmemory_human:2.32G
maxmemory_policy:volatile-lru
allocator_frag_ratio:1.01
allocator_frag_bytes:8732808
allocator_rss_ratio:1.02
allocator_rss_bytes:26775552
rss_overhead_ratio:0.98
rss_overhead_bytes:-27340800
mem_fragmentation_ratio:1.01
mem_fragmentation_bytes:8759592
mem_not_counted_for_evict:0
mem_replication_backlog:0
mem_clients_slaves:0
mem_clients_normal:922044
mem_aof_buffer:0
mem_allocator:jemalloc-5.1.0
active_defrag_running:0
lazyfree_pending_objects:0

# Persistence
loading:0
rdb_changes_since_last_save:5525184
rdb_bgsave_in_progress:0
rdb_last_save_time:1625379557
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:-1
rdb_current_bgsave_time_sec:-1
rdb_last_cow_size:0
aof_enabled:0
aof_rewrite_in_progress:0
aof_rewrite_scheduled:0
aof_last_rewrite_time_sec:-1
aof_current_rewrite_time_sec:-1
aof_last_bgrewrite_status:ok
aof_last_write_status:ok
aof_last_cow_size:0

# Stats
total_connections_received:76780
total_commands_processed:9953299
instantaneous_ops_per_sec:20
total_net_input_bytes:3135151221
total_net_output_bytes:9973695547
instantaneous_input_kbps:2.58
instantaneous_output_kbps:12.09
rejected_connections:0
sync_full:0
sync_partial_ok:0
sync_partial_err:0
expired_keys:1811958
expired_stale_perc:1.86
expired_time_cap_reached_count:0
evicted_keys:107069
keyspace_hits:1694960
keyspace_misses:532376
pubsub_channels:0
pubsub_patterns:4
latest_fork_usec:0
migrate_cached_sockets:0
slave_expires_tracked_keys:0
active_defrag_hits:0
active_defrag_misses:0
active_defrag_key_hits:0
active_defrag_key_misses:0


# CPU
used_cpu_sys:519.375822
used_cpu_user:2115.832520
used_cpu_sys_children:0.000000
used_cpu_user_children:0.000000

# SSL
ssl_enabled:no
ssl_connections_to_previous_certificate:0
ssl_connections_to_current_certificate:0
ssl_current_certificate_not_before_date:(null)
ssl_current_certificate_not_after_date:(null)
ssl_current_certificate_serial:0

# Cluster
cluster_enabled:0

# Keyspace
db0:keys=325773,expires=125691,avg_ttl=32887832

Thanks

UPDATE: Executing a keys * command that would have retrieved the keys with missing expired events, trigger the missing events. We still have no idea why this is a behaviour.

Solution

After a long time we concluded that this happened when the amount of keys with a TTL inside Redis is high and the variance and average of their TTL is also high. Our distribution was mainly comprised of keys that expired long time in the future (sometimes even months). And, we also had a small portion of keys (< 5%) which had a TTL of less than 1 minute.

This behaviour can then be explained in our situation by how redis expire keys. I cite:

Specifically this is what Redis does 10 times per second:

Test 20 random keys from the set of keys with an associated expire.

Delete all the keys found expired.

If more than 25% of keys were expired, start again from step 1.

This is a trivial probabilistic algorithm, basically the assumption is that our sample is representative of the whole key space, and we continue to expire until the percentage of keys that are likely to be expired is under 25%

The latter assumption does not hold for our TTLs distribution. So we concluded that Redis was not the right tool for our use case.

Also, in this documentation they explain that keys can also expire when a client tries to access them. Which explains why forcing an access to all keys (keys *) triggers their expiration.