I'm currently trying to build a database interface with python to store stock data. This data is in the form of a tuple list with each element consisting of "date, open, high, low, close, volume. date
represents a UNIX timestamp and has to be unique in combination with the ticker symbol in the database. Below is an example of a typically processed output (company_stock
):
[(1489780560, 'NYSE:F', 12.5, 12.505, 12.49, 12.495, 567726),
(1489780620, 'NYSE:F', 12.495, 12.5, 12.48, 12.48, 832487),
(1489780680, 'NYSE:F', 12.485, 12.49, 12.47, 12.475, 649818),
(1489780740, 'NYSE:F', 12.475, 12.48, 12.47, 12.47, 700579),
(1489780800, 'NYSE:F', 12.47, 12.48, 12.47, 12.48, 567798)]
I'm using the pymysql
package to insert this list into a local MySQL database (Version 5.5). While the code runs through and the values get inserted, the database will crash - or rather stop - after reaching about ~250k rows. Since the relevant This is the export part of the stock data processing function which gets called about once every 20 seconds and inserts about 400 values.
# SQL Export
def tosql(company_stock, ticker, interval, amount_period, period):
try:
conn = pymysql.connect(host = "localhost", user = "root",
passwd = "pw", db = "db", charset = "utf8",
autocommit = True,
cursorclass = pymysql.cursors.DictCursor)
cur = conn.cursor()
# To temp table
query = "INSERT INTO stockdata_import "
query += "(date, tickersymbol, open, high, low, close, volume)"
query += "VALUES (%s, %s, %s, %s, %s, %s, %s)"
cur.executemany(query, company_stock)
# Duplicate Check with temp table and existing database storage
query = "INSERT INTO stockdata (date, tickersymbol, open, high, low, close, volume) "
query += "SELECT i.date, i.tickersymbol, i.open, i.high, i.low, "
query += "i.close, i.volume FROM stockdata_import i "
query += "WHERE NOT EXISTS(SELECT dv.date, dv.tickersymbol FROM "
query += "stockdata dv WHERE dv.date = i.date "
query += "AND dv.tickersymbol = i.tickersymbol)"
cur.execute(query)
print(": ".join([datetime.now().strftime("%d.%m.%Y %H:%M:%S"),
"Data stored in Vault. Ticker", str(ticker),
"Interval", str(interval),
"Last", str(amount_period), str(period)]))
finally:
# Clear temp import table and close connection
query = "DELETE from stockdata_import"
cur.execute(query)
cur.close()
conn.close()
I suspect that the check for already existent values takes too long as the database grows and eventually breaks down due to the lock of the tables (?) while checking for uniqueness of the date/ticker combination. Since I expect this database to grow rather fast (about 1 million rows per week) it seems that a different solution is required to ensure that there is only one date/ticker pair. This is the SQL CREATE
statement for the import table (the real table with which it gets compared looks the same):
CREATE TABLE stockdata_import (id_stock_imp BIGINT(12) NOT NULL AUTO_INCREMENT,
date INT(10),
tickersymbol VARCHAR(16),
open FLOAT(12,4),
high FLOAT(12,4),
low FLOAT(12,4),
close FLOAT(12,4),
volume INT(12),
crawled_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY(id_stock_imp));
I have already looked into setting a constraint for the date
/tickersymbol
pair and to handle upcoming exceptions in python, but my research so far suggested that this would be even slower plus I am not even sure if this will work with the bulk insert of the pymysql
cursor function executemany(query, data)
.
Context information:
Questions:
date
/ticker
combination?If you would like to ensure uniqueness of your data, then just add a unique index on the relevant date
and ticker
fields. Unique index prevents duplicate values from being inserted, therefore there is no need to check for the existence of data before the insertion.
Since you do not want to insert duplicate data, just use insert ignore instead of plain insert to supress duplicate insert errors. Based on the mumber of affected rows, you can still detect and log duplicate insertions.