Search code examples
pythonmysqllarge-data

Speed up inserting large datasets from txt file to mySQL using python


background: I have 500 formatted *.txt files that I need to insert into a mysql database. Currently I have a python script to read the files line by line and insert into mySQL database.

Problem: the files are quite big (~100M per txt file), I tested the script and it takes too long to insert just one file to database.

How can I speed up the process by modifying the scripts?

code:

for file in os.listdir(INPUTFILEPATH):
    ## index += 1
    ## print "processing %s out of %s files " % (index, totalfiles)
    inputfilename = INPUTFILEPATH + "/" + file 
    open_file = open(inputfilename, 'r')
    contents = open_file.readlines()
    totalLines = len(contents)
    ## index2 = 0 
    for i in range(totalLines):
        ## index2 +=1
        ## print "processing %s out of %s lines " % (index2, totalLines)
        lineString = contents[i]
        lineString = lineString.rstrip('\n')
        values = lineString.split('\t')
        if ( len(re.findall(r'[0123456789_\'\.]',values[0])) > 0 ):  
            continue 
        message = """INSERT INTO %s(word,year,count,volume)VALUES('%s','%s','%s','%s')"""% ('1gram', values[0],values[1],values[2],values[3]) 
        cursor.execute(message)
        db.commit()

cursor.close()
db.close() 

Solution

  • Two options to consider:

    1) the easiest is to include multiple rows of values on one insert. This is way, way faster than doing multiple indserts.

    Insetad of doing INSERT INTO tbl ( cols ) VALUES ( vals ), do something like INSERT INTO tbl ( cols ) VALUES ( vals ), ( vals ), ( vals )

    The amount of rows you can insert at once depends on the maximum packet size of the mysql server, but you can probably do 100, 1000, maybe 10000 rows safely and it should give you a performance increase of an order of magnitude or more.

    See http://dev.mysql.com/doc/refman/5.5/en/insert-speed.html

    2) LOAD DATA INFILE is a bit different, requires more work and has its own requirements, but is very very fast.