Search code examples
mysqlperformancelarge-data

Randomize timestamp column in large MySQL table


I have a test database table with ~100m rows which were generated by cloning original 3k rows multiple times. Let's say this table describes some events which have timestamps. Due to cloning now we have ~10m events per day which is far from real cases. So I'd like to randomize the date column and scatter records for several days. Here is the procedure I've come up with:

DROP PROCEDURE IF EXISTS `randomizedates`;
DELIMITER //
CREATE PROCEDURE `randomizedates`(IN `daterange` INT)
BEGIN
  DECLARE id INT UNSIGNED;
  DECLARE buf TIMESTAMP;
  DECLARE done INT DEFAULT FALSE;
  DECLARE cur1 CURSOR FOR SELECT event_id FROM events;
  DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;

  OPEN cur1;
  the_loop: LOOP
    FETCH cur1 INTO id;
    IF done THEN
      LEAVE the_loop;
    END IF;
    SET buf = (SELECT NOW() - INTERVAL FLOOR(RAND() * daterange) DAY);
    UPDATE events SET starttime = buf WHERE event_id = id;
  END LOOP the_loop;
  CLOSE cur1;
END //
DELIMITER ;

On 3k table it executes for ~6 seconds so assuming linear сomplexity it will take ~50 hours to be applied on 100m table. Is there a way to speed it up? Or maybe my procedure is incorrect at all?


Solution

  • Just do:

    set @datarange = 7;
    update `events`
    set starttime = NOW() - INTERVAL FLOOR(RAND()) * @datarange DAY;
    

    Databases are not good at fetching and processing single rows in a lopp, like we are used to do in procedural languages (iterators, for each loops, arrays etc), they are best at, and optimized for processing SQL, which is essetially a declarative language - you declare what you want to get without specyfying how to do it, in contrast to procedural languages, which are used to specify the steps the program must do.

    Remember - row by row = slow by slow.

    Look at simple example that simulates your table and compares your procedure to UPDATE:

    drop table `events`;
    create table `events` as 
    select * from information_schema.tables
    where 1=0; 
    
    alter table `events` add column event_id int primary key auto_increment first;
    
    alter table `events` change column create_time starttime timestamp;
    
    insert into `events`
    select null, t.*
    from information_schema.tables t
    cross join (
      select 1 from information_schema.tables
      limit 100
    ) xx
    
    mysql> select count(*) from `events`;
    +----------+
    | count(*) |
    +----------+
    |    17200 |
    +----------+
    

    We created a table with 17 thousand rows. Now we call the procedure:

    mysql> call `randomizedates`(7);
    Query OK, 0 rows affected (34.26 sec)
    

    and the update command:

    mysql>     set @datarange = 7;
    Query OK, 0 rows affected (0.00 sec)
    
    mysql>     update `events`
        ->     set starttime = NOW() - INTERVAL FLOOR(RAND()) * @datarange DAY;
    Query OK, 17200 rows affected (0.23 sec)
    Rows matched: 17200  Changed: 17200  Warnings: 0
    

    As you see - 34 seconds / 0.23 second = 14782 % faster - it's a huge difference !!!