Some background: we are building a machine learning farm off Ignite cluster. Part of the use case is generating training data sets, which are gigantic matrices (up to billions of rows x thousands of columns in theory), one row per data entry in Ignite cache.
We are using SqlQuery
to fetch the records matching a predicate locally on each node, then iterate over records, generate vectors, write them into an external storage for further consumption. Each node exports data independently, so for 32 nodes we end up with 32 exported data sets.
The problem: small data set generations worked OK, but larger data set generations (queries expected to return 10M+ rows per node) basically kill the entire cluster, blowing out the nodes due to OOME and GC hell. We looked at "Performance and Debugging" section of Ignite docs (https://apacheignite-sql.readme.io/docs/performance-and-debugging#result-set-lazy-load), tried lazy result set and page size settings. Nope.
The investigations (profiling, memory dumps, debugger, etc), suggested that result sets of the queries are loaded into memory completely before our code gets to read the first row, even though we are using QueryCursor
and iterations. The Query#pageSize
and SqlFieldsQuery#setLazy
do not seem to have any effect on that whatsoever. Digging deeper, it turned out that H2 (that Ignite indexing is using) does not support server-side cursor in result sets at all, and can only be configured to buffer the records onto disk (even with SSDs, this is a performance non-starter). Reading through Ignite source code suggested that Query#pageSize
and SqlFieldsQuery#setLazy
are only used in distributed queries / scan queries, and still Ignite does full reads of result sets on the nodes into memory.
Sigh. These are remediations we thought of:
h2.maxMemoryRows
set to some small value. Seems like a silly solution (memory-to-memory in the same JVM through a buffer on disk? really?).SQLQuery
/ SQLFieldsQuery
, bypass H2, use ScanQuery
. This is a ton of work (have to parse predicate and compile them into IgniteBiPredicate
, plus these are full table scans, therefore no index / optimization, which sort makes the whole ordeal pointless).The question: is our assessment correct and query result streaming is not a thing for local queries? Are there any more sensible workarounds?
Would appreciate some insight from Ignite comrades if they read this. Thanks in advance!
UPDATE: in case it matters, the cluster is 8 nodes as follows:
Architecture: x86_64
CPU(s): 32
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
CPU MHz: 2345.904
BogoMIPS: 4600.05
Hypervisor vendor: Xen
Virtualization type: full
RAM: 240Gb
These are designated EC2-s with ephemeral volumes for Ignite /data
mount.
Updating this after seemingly solving the issue.
Note: the question was about local queries.
It looks like using SqlFieldsQuery
with 'lazy' helped for local queries. Local SqlQuery
seems to be ignoring the lazy
flag.