Troubleshooting and fixing Cassandra OOM issue

Although there are multiple threads regarding the OOM issue would like to clarify certain things. We are running a 36 node Cassandra cluster of 3.11.6 version in K8's with 32gigs allocated for the container.

The container is getting OOM killed (Note:- Not java heap OOM error rather linux cgroup OOM killer) since it's reaching the memory limit of 32 gigs for its cgroup.

Stats and configs

map[limits:map[ephemeral-storage:2Gi memory:32Gi] requests:map[cpu:7 ephemeral-storage:2Gi memory:32Gi]]

Cgroup Memory limit 34359738368 -> 32 Gigs

The JVM spaces auto calculated by Cassandra -Xms19660M -Xmx19660M -Xmn4096M

Grafana Screenshot

Cassandra Yaml --> https://pastebin.com/ZZLTc1cM

JVM Options --> https://pastebin.com/tjzZRZvU

Nodetool info output on a node which is already consuming 98% of the memory

nodetool info
ID                     : 59c53bdb-4f61-42f5-a42c-936ea232e12d
Gossip active          : true
Thrift active          : true
Native Transport active: true
Load                   : 179.71 GiB
Generation No          : 1643635507
Uptime (seconds)       : 9134829
Heap Memory (MB)       : 5984.30 / 19250.44
Off Heap Memory (MB)   : 1653.33
Data Center            : datacenter1
Rack                   : rack1
Exceptions             : 5
Key Cache              : entries 138180, size 99.99 MiB, capacity 100 MiB, 9666222 hits, 10281941 requests, 0.940 recent hit rate, 14400 save period in seconds
Row Cache              : entries 10561, size 101.76 MiB, capacity 1000 MiB, 12752 hits, 88528 requests, 0.144 recent hit rate, 900 save period in seconds
Counter Cache          : entries 714, size 80.95 KiB, capacity 50 MiB, 21662 hits, 21688 requests, 0.999 recent hit rate, 7200 save period in seconds
Chunk Cache            : entries 15498, size 968.62 MiB, capacity 1.97 GiB, 283904392 misses, 34456091078 requests, 0.992 recent hit rate, 467.960 microseconds miss latency
Percent Repaired       : 8.28107989669628E-8%
Token                  : (invoke with -T/--tokens to see all 256 tokens)

What had been done

We had made sure there is no memory leak on the cassandra process since we have a custom trigger code. Gc log analytics shows we occupy roughly 14 gigs of total jvm space.

Questions

Although we know cassandra does occupy off heap spaces (Bloom filter, Memtables , etc )

The grafana screenshot shows the node is occupying 98% of 32 gigs. JVM heap = 19.5 gigs + offheap space in nodetool info output = 1653.33 MB (1Gigs) (JVM heap + off heap = 22 gigs ). Where is the remaining memory (10 gigs) ?. How to exactly account what is occupying the remaining memory. (Nodetool tablestats and nodetool cfstats output are not shared for complaince reasons) ?

Our production cluster requires tons of approval so deploying them with jconsole remote is tough. Any other ways to account for this memory usage.

Once we account the memory usage what are the next steps to fix this and avoid OOM kill ?

Solution

There's a good chance that the SSTables are getting mapped to memory (cached with mmap()). If this is the case, it wouldn't be immediate and memory usage would grow over time depending on when SSTables are read which are then cached. I've written about this issue in https://dba.stackexchange.com/q/340515/255786.

There's an issue with a not-so-well-known configuration property called "disk access mode". When it's not set it cassandra.yaml, it defaults to mmap which means that all SSTables get mmaped to memory. If so, you'll see an entry in the system.log on startup that looks like:

INFO  [main] 2019-05-02 12:33:21,572  DatabaseDescriptor.java:350 - \
  DiskAccessMode 'auto' determined to be mmap, indexAccessMode is mmap

The solution is to configure disk access mode to only cache SSTable index files (not the *-Data.db component) by setting:

disk_access_mode: mmap_index_only

For more information, see the link I posted above. Cheers!

UPDATE 2024 - Previously linked to DataStax Community article #6947 which has been retired in preference for the Stack Exchange Network.