Search code examples
pythonsqlmysqlpythonanywhere

Extremely slow INSERT query when database not local


I'm adding data to a MySQL 8 database on PythonAnywhere. I'm doing this via a python 3.10 script run in VSCode using an SSH Tunnel. When I add the data it takes approximately 50 seconds to add 275 rows, data in json format is 750kb so total likely a bit less than that. No error messages.

When I use the same script to add the identical data to a database on my local machine it takes 0.2 seconds. Playing with batch sized between commits reduced the total time from 75 seconds to 50 seconds, but that's the best I've got. It's for a web scraper, when the full thing is running it could have a few hundred to a few thousand rows to add/remove, so it's not tenable to leave it like this. With performance this poor I must be doing something wrong, but I can't see what it is. The timer is wrapped around the insert query, it doesn't include the ssh connection time etc.

What's going on here?

The table columns are as follows, so far now indexes being used:

"listingID int PRIMARY KEY AUTO_INCREMENT, 
types VARCHAR(11), 
town VARCHAR(255), 
postcode CHAR(5), 
price INT UNSIGNED, 
agent VARCHAR(50), 
ref VARCHAR(30), 
bedrooms SMALLINT UNSIGNED, 
rooms SMALLINT UNSIGNED, 
plot MEDIUMINT UNSIGNED, 
size MEDIUMINT UNSIGNED, 
link_url VARCHAR(1024), 
description VARCHAR(14000), 
photos TEXT, 
photos_hosted TEXT, 
gps POINT, 
id VARCHAR(80), 
types_original VARCHAR(30)"

The functions adding the data are below. As I said it completes in 0.2 seconds locally so I don't think I'm doing anything too bad there, even if it's not perfect. listings is a list of dictionaries. The weird use of dictionaries for the inserts is because I was using named placeholders with mysql.connector, but I had to switch to MySQLdb due to another problem with accessing the database, and it doesn't permit named placeholders.

def insert_data_to_table(cursor, table_name, columns_list, values_dict, gps_string):
    # Creates a string of csv %s placeholders for the values to be inserted
    placeholders = ", ".join(f"%s" for _ in values_dict.keys())

    # Creates a string of columns to be inserted
    columns = ", ".join(columns_list)

    # If GPS data is provided, insert specific string into query. Else build query without GPS sub string.
    if gps_string:
        insert_query = f"INSERT INTO {table_name} (gps, {columns}) VALUES ({gps_string}, {placeholders})"
    else:
        insert_query = f"INSERT INTO {table_name} ({columns}) VALUES ({placeholders})"

    cursor.execute(insert_query, tuple(values_dict.values()))

def add_listings(cursor, listings):
    print("Adding data...")
    counter = 0
    for listing in listings:
    # Committing every 50 rows rather than after each row reduces time by approx 33%, minimal differences 
     between 20, 50, 100 etc.
        if counter > 50:
            print("Commit", counter)
            db.commit()
            counter = 0
        columns_list = []

        values_dict = {key: None for key in listing if key != "gps"}

        if listing.get("gps") is None:
            gps_string = None
        elif isinstance(listing.get("gps"), list):
            gps_string = f"ST_GeomFromText('POINT({round(listing['gps'][0], 6)} {round(listing['gps'][1], 6)})', 4326)"

        for key in values_dict:
            if isinstance(listing.get(key), list):
                values_dict[key] = ":;:".join([str(x) for x in listing[key]])
            else:
                values_dict[key] = listing.get(key)
            columns_list.append(key)

        insert_data_to_table(cursor, table_name, columns_list, values_dict, gps_string)
        counter += 1
    db.commit()

Solution

  • The main issue turned out to be my inefficient code. By rewriting it to use executemany it reduced the time taken from 50s to 5s.

    Also the database is hosted in USA and I'm in France, which probably doesn't help.