Search code examples
mysqllarge-datawindow-functionsmysql-8.0

mysql 8 window function wrong results


I have a problem with window functions in MySQL8 - they give incorrect results when applied to large tables (either: large number of rows or large number of columns).

Example: Table: bureau.csv (1.7 mln rows) from https://www.kaggle.com/c/home-credit-default-risk/data

I run 3 simple queries changing only the number of rows of the table to be used and number of columns to be outputed. You can clearly see that the combination of large amount of rows and many outputed columns gives incorrect results of "count(*) over()" - the last column.

high number of rows, low number of columns - result: OK;

high number of rows, high number of columns - result: INCORRECT;

low number of rows, hight number of columns - result: OK;

Ubuntu 16.04, 32GB of RAM

Many thanks for Your help! :)

Witold

my.cnf:
[mysqld]

innodb_buffer_pool_size = 26G
default_authentication_plugin=mysql_native_password
thread_cache_size = 50
innodb-flush-method=O_DIRECT
local_infile=ON
innodb_thread_concurrency=2
internal_tmp_mem_storage_engine=MEMORY
join_buffer_size=1G
temptable_max_ram=4G
tmp_table_size=4G
max_heap_table_size=4G
mysqlx_connect_timeout=99999
mysqlx_read_timeout=99999
mysqlx_write_timeout=99999
net_read_timeout=99999
net_write_timeout=99999
regexp_time_limit=99999
mysqlx_port_open_timeout=99999
windowing-use-high-precision=OFF
sort_buffer_size=4G

The code to run:

select a.*
from
(
select #b.sk_id_curr,
       b.*,
       count(*) over(partition by b.sk_id_curr) as counter
from (select * from bureau limit 10000000) b
) a
order by a.sk_id_curr desc
limit 100
;

edit: exlain and indexes picture

And one more thing I noticed: in the "INCORRECT" picture all columns are incorrect (not only the last one) - compare to pictures with the "GOOD" result" (look at sk_id_curr).

As requested by Wilson Hauck: A) complete (not edited) my.cnf-ini Text results of: B) SHOW GLOBAL STATUS; C) SHOW GLOBAL VARIABLES; D) SHOW ENGINE INNODB STATUS; SHOW CREATE TABLE bureau; part1 part2

A sample 1000 rows on DB Fiddle: https://www.db-fiddle.com/f/fzXsN6vFzidhanxeUjWkiB/0

The way I imported data to mysql:

First I replaced "blank spaces" in the csv with "NULL" in python:

import pandas as pd
bureau = pd.read_csv('../input/bureau.csv')
bureau.to_csv('../input/bureau2.csv',index=False,na_rep="NULL",header=True)

Second I used the code in mysql:

LOAD DATA LOCAL INFILE '../input/bureau2.csv' INTO TABLE bureau
FIELDS TERMINATED BY ',' ENCLOSED BY '"' 
LINES TERMINATED BY '\n'
IGNORE 1 LINES
;

Solution

  • This version gave expected results for 8.0.11 for me (notice there are no longer any text blobs). You also need to disable --big_tables if you have it set for this to work.

    CREATE TABLE `bureau` (
     `SK_ID_CURR` int(11) DEFAULT NULL,
     `SK_ID_BUREAU` int(11) DEFAULT NULL,
     `CREDIT_ACTIVE` varchar(20),
     `CREDIT_CURRENCY` varchar(20),
     `DAYS_CREDIT` int(11) DEFAULT NULL,
     `CREDIT_DAY_OVERDUE` int(11) DEFAULT NULL,
     `DAYS_CREDIT_ENDDATE` varchar(20),
     `DAYS_ENDDATE_FACT` varchar(20),
     `AMT_CREDIT_MAX_OVERDUE` varchar(20),
     `CNT_CREDIT_PROLONG` int(11) DEFAULT NULL,
     `AMT_CREDIT_SUM` double DEFAULT NULL,
     `AMT_CREDIT_SUM_DEBT` varchar(20),
     `AMT_CREDIT_SUM_LIMIT` varchar(20),
     `AMT_CREDIT_SUM_OVERDUE` double DEFAULT NULL,
     `CREDIT_TYPE` varchar(20),
     `DAYS_CREDIT_UPDATE` int(11) DEFAULT NULL,
     `AMT_ANNUITY` varchar(20),
     KEY `bureau` (`SK_ID_CURR`,`SK_ID_BUREAU`),
     KEY `bureau_i2` (`SK_ID_BUREAU`)
     ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
    
    set session sql_mode='';
    load data infile '/export/home/tmp/dag/git/mysql/bureau.csv' into table bureau columns terminated by "," ignore 1 lines;
    
    
    select a.*
    from
    (
    select #b.sk_id_curr,
           b.*,
           count(*) over(partition by b.sk_id_curr) as counter
    from (select * from bureau) b
    ) a
    order by a.sk_id_curr desc
    limit 100
    ;