Search code examples
pythonsqlignitein-memory-databasekey-value-store

Apache Ignite inserts extremely slow


I'm attempting to load a large matrix into an Apache Ignite master node running in AWS. The EC2 instance has 128GB of memory and 512GB of disk space.

The matrix is a CSV with 50,000 columns and 15,000 rows.

The loading is extremely slow - the first 150 inserts batch together and take over 30 minutes to work. I am using the Python Thin Client

import pandas as pd
import pyignite
from pyignite import Client

client = Client()

client.connect('127.0.0.1', 10800)
print('deleting records...')
client.sql('DELETE FROM full_test_table')
df = pd.read_csv('exon.csv')

col = list(df)
col = col[1:]

names = ', '.join('"' + item + '"' for item in col)
names = 'name, ' + names
#print(names)

for index, row in df.iterrows():
    print('inserting for {0}'.format(str(row[0])))
    row[0] = '\"{0}\"'.format(row[0])

    row[0] = str(index)

    values = ', '.join(str(item) for item in row)
    sql = 'INSERT INTO full_test_table ({0}) VALUES({1})'.format(names, values)
    client.sql(sql)

I would like to use Python to load the data, as I'm more familiar with that than Java. This seems unreasonably slow to me - even PostgreSQL can take these inserts in seconds. What's the issue?

I've tried the COPY command from CSV as well - that doesn't seem to work any faster.


Solution

  • As of Ignite 2.7, Python Thin Client, as well as other thin clients, use one of the server nodes as a proxy - usually, the one you set in the connection string. The proxy receives all the requests from the client and directs them to the rest of the servers if needed. Also, the proxy sends result sets back to the client. So, the proxy might be a bottleneck in your cases as well as overall network throughput. Check that the proxy server doesn't overutilize CPUs and doesn't have any issues related to garbage collection or memory utilization. The proxy won't be needed in Ignite 2.8 any longer.

    Anyway, the fastest way to preload data in Ignite is with the usage of IgniteStreaming APIs. Those are not available for Python yet but a Java application is pretty straightforward. You can use this example as a reference by putting your records into the streamer with key-value APIs.

    If you'd like to continue using SQL INSERTS then use either JDBC or ODBC driver together with SET STREAMING command.