Search code examples
mysqlpython-3.xsql-insertdatabase-deadlocks

SQL Deadlock with Python Data Insert


I'm currently trying to build a database interface with python to store stock data. This data is in the form of a tuple list with each element consisting of "date, open, high, low, close, volume. date represents a UNIX timestamp and has to be unique in combination with the ticker symbol in the database. Below is an example of a typically processed output (company_stock):

 [(1489780560, 'NYSE:F', 12.5, 12.505, 12.49, 12.495, 567726),
 (1489780620, 'NYSE:F', 12.495, 12.5, 12.48, 12.48, 832487),
 (1489780680, 'NYSE:F', 12.485, 12.49, 12.47, 12.475, 649818),
 (1489780740, 'NYSE:F', 12.475, 12.48, 12.47, 12.47, 700579),
 (1489780800, 'NYSE:F', 12.47, 12.48, 12.47, 12.48, 567798)]

I'm using the pymysql package to insert this list into a local MySQL database (Version 5.5). While the code runs through and the values get inserted, the database will crash - or rather stop - after reaching about ~250k rows. Since the relevant This is the export part of the stock data processing function which gets called about once every 20 seconds and inserts about 400 values.

# SQL Export
def tosql(company_stock, ticker, interval, amount_period, period):
    try:
        conn = pymysql.connect(host = "localhost", user = "root", 
                               passwd = "pw", db = "db", charset = "utf8",
                               autocommit = True, 
                               cursorclass = pymysql.cursors.DictCursor)
        cur = conn.cursor()

        # To temp table
        query = "INSERT INTO stockdata_import "
        query += "(date, tickersymbol, open, high, low, close, volume)"
        query += "VALUES (%s, %s, %s, %s, %s, %s, %s)"
        cur.executemany(query, company_stock)

        # Duplicate Check with temp table and existing database storage
        query = "INSERT INTO stockdata (date, tickersymbol, open, high, low, close, volume) "
        query += "SELECT i.date, i.tickersymbol, i.open, i.high, i.low, "
        query += "i.close, i.volume FROM stockdata_import i "
        query += "WHERE NOT EXISTS(SELECT dv.date, dv.tickersymbol FROM "
        query += "stockdata dv WHERE dv.date = i.date "
        query += "AND dv.tickersymbol = i.tickersymbol)"
        cur.execute(query)

        print(": ".join([datetime.now().strftime("%d.%m.%Y %H:%M:%S"), 
                         "Data stored in Vault. Ticker", str(ticker),
                         "Interval", str(interval), 
                         "Last", str(amount_period), str(period)]))
    finally:
        # Clear temp import table and close connection
        query = "DELETE from stockdata_import"
        cur.execute(query)
        cur.close()
        conn.close()

I suspect that the check for already existent values takes too long as the database grows and eventually breaks down due to the lock of the tables (?) while checking for uniqueness of the date/ticker combination. Since I expect this database to grow rather fast (about 1 million rows per week) it seems that a different solution is required to ensure that there is only one date/ticker pair. This is the SQL CREATE statement for the import table (the real table with which it gets compared looks the same):

CREATE TABLE stockdata_import (id_stock_imp BIGINT(12) NOT NULL AUTO_INCREMENT,
             date INT(10),
             tickersymbol VARCHAR(16),
             open FLOAT(12,4),
             high FLOAT(12,4),
             low FLOAT(12,4),
             close FLOAT(12,4),
             volume INT(12),
             crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
             PRIMARY KEY(id_stock_imp));

I have already looked into setting a constraint for the date/tickersymbol pair and to handle upcoming exceptions in python, but my research so far suggested that this would be even slower plus I am not even sure if this will work with the bulk insert of the pymysql cursor function executemany(query, data).

Context information:

  • The SQL export shown above is the final part of a python script handling the stock data response. This script, in turn, gets called by another script which is timed by a crontab to run at a specific time each day.
  • Once the crontab starts the control script, this will call the subscript about 500 times with a sleep time of about 20-25 seconds between each run.
  • The error which I see in the logs is: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Questions:

  1. How can I optimize the query or alter the storage table to ensure uniqueness for a given date/ticker combination?
  2. Is this even the problem or do I fail to see some other problem here?
  3. Any further advice is also welcome.

Solution

  • If you would like to ensure uniqueness of your data, then just add a unique index on the relevant date and ticker fields. Unique index prevents duplicate values from being inserted, therefore there is no need to check for the existence of data before the insertion.

    Since you do not want to insert duplicate data, just use insert ignore instead of plain insert to supress duplicate insert errors. Based on the mumber of affected rows, you can still detect and log duplicate insertions.