Search code examples
sqldjangospring-bootmariadbbatch-processing

Is the batch processing speed of Python slower than that of Java?


I am working on migrating a function implemented in Java Spring Boot to Python Django.

I encountered an issue while migrating a function that retrieves a list of keywords through a select query and updates these keywords using an update query.

Specifically, updating 450 rows takes around 0.1 seconds in Java, but it takes 2 seconds in Python.

Below are the methods I used to measure the time and the Java and Python code.

Measurement method

// java

    long startTime = System.nanoTime();
    // Execute the update
    long endTime = System.nanoTime();
    double duration = (endTime - startTime) / 1_000_000_000.0;;
    String formattedDuration = String.format("%.2f seconds", duration);
    log.info("processing_time: {}", formattedDuration);

// python

    start_time = time.perf_counter()
    // Execute the update
    end_time = time.perf_counter()
    processing_time = end_time - start_time
    processing_time_formatted = f"{processing_time:.2f}"
    logger.info(f"processing_time: {processing_time_formatted} seconds")

Logic

Java

// application.yml

spring:
  datasource:
    driver-class-name: org.mariadb.jdbc.Driver
    url: jdbc:mariadb:...
    username: ...
    password: ...
  jpa:
    hibernate:
      ddl-auto: none
    properties:
      hibernate:
        show_sql: true
        use_sql_comments: true
        format_sql: true
        dialect : org.hibernate.dialect.MariaDBDialect
    open-in-view: false
  sql:
    init:
      mode: never


// service

    public void updateUserIntrKwd(String indvNum) {
        // 2s
        List<IntrKwdDto> kwdList = intrKwdMapper.selectIntrKwdList(indvNum);
        // 0.001s
        DatesAndMaxFrequency datesAndMaxFrequency = getDatesAndMaxFrequency(kwdList);
        // 0.1s
        intrKwdMapper.updateUserIntrKwd(kwdList, datesAndMaxFrequency);
    }


// Mapper
  <update id="updateUserIntrKwd" parameterType="map">
    <foreach collection="kwdList" item="item" separator=";">
...

Python

// settings.py

DATABASES = {
    "default": {
        "ENGINE": "django.db.backends.mysql",
        "NAME": "...",
        "USER": "...",
        "PASSWORD": "...",
        "HOST": "...",
        "PORT": 3306,
    },


// service

def update_user_interest_keywords(indv_num: str) -> int:
    // 2s
    user_kwd_list = select_interest_keywords_list(indv_num)
    // 0.001s
    dates_and_max_frequency = get_dates_and_max_frequency(user_kwd_list)
    // 2s
    updated_rows = 
            execute_user_interest_keywords_query(user_kwd_list, dates_and_max_frequency)

    return updated_rows


// query

def execute_user_interest_keywords_query(userKwdList: list[IntrKwd], datesAndMaxFrequency: DatesAndMaxFrequency):
    base_query = """
        query...
    """

    def get_query_params(item):
        return (
            item.last_aper_yrmo,
        ...
        )

    params = [get_query_params(item) for item in userKwdList]

    try:
        with connections['...'].cursor() as cursor:

            cursor.executemany(base_query, params)
            updated_rows = cursor.rowcount
    ...
    return updated_rows

They share the same database, and I have structured the Python code to be as similar to the Java code as possible.

Notably, the processing speed of the SELECT query is the same. The difference in performance only occurs with the UPDATE query.

I believe the only difference is the environment in which the servers are running.

Based on this thought, I have tried the following methods:

  1. Switching the database library
    I attempted to switch the database library from 'pymysql' to 'mysqlclient', following the advice that 'mysqlclient' might perform better.

    Result: No significant change.

  2. Changing options in 'settings.py'

    'OPTIONS': {
        'init_command': "SET sql_mode='STRICT_TRANS_TABLES'",
        'autocommit': True,
        'connect_timeout': 10,
        'charset': 'utf8mb4',
        'use_unicode': True,
    },
    'CONN_MAX_AGE': 60,
    

    I added the above options to the 'DATABASES'

    Result: No significant change.

  3. Changing .executemany() to .execute()
    Based on the advice that .executemany() might only perform well with INSERT and REPLACE statements, I modified the logic to use .execute() instead.

    Result: No significant change.

  4. Installing the 'mariadb' library
    The database in use is MariaDB version 10.11.1. Based on the advice that installing the 'mariadb' library might improve performance, I added the 'mariadb' library.

    Result: No significant change.

A coding newbie desperately needs your help! Why on earth is the UPDATE query taking over ten times longer, even though the logic is exactly the same?


Solution

  • To long for a comment:

    You're obviously using MariaDB Connector/Java, but not MariaDB Connector/Python, so it's like comparing apples with pears.

    All official MariaDB Connectors support INSERT,UPDATE,DELETE in batch (bulk) mode, this feature is not supported by other Python drivers like pymysql, mysqlclient or MySQL Connector/Python. Depending on the kind of connection (remote or not) bulk mode is up to 15 times faster.

    Using multiple execute() with autocommit will slow down performance - instead start a transaction, execute all statements and commit at the end.

    Unfortunately Django misses native MariaDB support, my PR which I submitted some years ago was closed as "won't fix". A better alternative for MariaDB might be SQLAlchemy, which has it's own dialect for MariaDB Connector/Python and supports native bulk operations.