Search code examples
sqlperformancekey-valueignitein-memory-database

Simple SELECT(*) queries very slow in Apache Ignite


I'm prototyping Apache Ignite for use in a new product. I need to store a very wide matrix in memory and access various random rows and columns. The matrix has 50,000 columns and potentially millions of rows. The matrix will essentially be read-only so we don't worry about write speeds.

We do not ever need to order the results. The client application will request specific rows by the primary key and perhaps certain columns, and occasionally the entire row(s).

I loaded the data into Apache Ignite into a SQL table that looks like this:

CREATE TABLE myMatrix
name CHAR(20) PRIMARY KEY,
col1 INT,
col2 INT,
col3 INT,
...
col50000 INT

I've also indexed the 'name' column

CREATE INDEX name_index ON myMatrix(name)

However, when I try to simply select one row, it takes over 10 seconds to return the result. We need much much faster response times - the client will expect a set of rows (could be hundreds or thousands) and columns (up to all the columns) in less than a second.

import pandas as pd
import pyignite
from pyignite import Client
import time

client = Client()

client.connect('127.0.0.1', 10800)

now = time.time()
result = client.sql('SELECT * FROM full_test_table WHERE      name=\'F1S4_160106_058_G01\'')
print('Got Result in')
print(time.time() - now)

Why is this so painfully slow? I was hoping holding the entire table in memory would give faster results. Is it just the width of the table that's causing issues?

The database is running on an r4.8xlarge instance, with 32 cores and 244GB of memory.


Solution

  • I have tried running an example very similar to yours, and it turns out that Ignite's SQL parser (based on H2 SQL parser) has quadratic performance compexity of number of response columns. This means it is totally unfeasible having 50,000 columns in a table in Ignite, or anything larger than few dozens for that matter. I will try to file an issue against H2 bug tracker.

    Previous answer:

    Do you have persistence, or is everything in the RAM? I just can't see why it would take so long. Maybe there's some inlining problem - have you tried CREATE INDEX name_index ON myMatrix(name) INLINE_SIZE 25?

    I second that having 50,000 columns is not optimal. Better use an array.