Search code examples
mysqlsplitpartitionvarcharchunks

Re-split varchar column values in MySQL based on chunk size


I have this table where long strings are saved by splitting in the VALUE column and grouped by GROUPING_ID:

CREATE TABLE input_table
(
    CHUNK_ID     BIGINT        NOT NULL
        PRIMARY KEY,
    VALUE        VARCHAR(5000) NULL,
    GROUPING_ID BIGINT         NULL
);

I want to resize the VALUE column from VARCHAR(5000)to VARCHAR(4000).

Values are grouped by GROUPING_ID, so a string with 24123 chars initially will be split into 5 rows, 4 of 5000 chars and 1 of 4123 chars. To reduce the column size, I need to concatenate the grouped strings and then re-split by 4000, so a string with 24123 chars will be saved on 7 rows, 6 of 4000 chars and 1 of 123 chars.

I want to do this in MySQL without using functions.


Solution

  • After some research I managed to concatenate the values and then with the help of ChatGPT I managed to split the concatenated string into smaller parts.

    This solution will keep the right sorting of the chunks, and will allow configuring smaller or bigger chunk size, so you can use it to reduce or to increase column size.

    To re-split the VALUE column by 4000 this can be used:

    SET @chunk_size = 4000;
    
    SET group_concat_max_len = @chunk_size * 50 * 2; -- chunk_size x 50 unions below x 2 bytes
    
    START TRANSACTION;
    
    DELETE
    FROM input_table
    WHERE GROUPING_ID IS NULL;
    
    CREATE TEMPORARY TABLE temp_all_concat_input
    SELECT GROUP_CONCAT(VALUE SEPARATOR '') AS CONCAT_VALUE, GROUPING_ID AS TEMP_GROUPING_ID
    FROM input_table
    GROUP BY GROUPING_ID;
    
    CREATE TEMPORARY TABLE temp_re_split_input AS
    SELECT SUBSTRING(CONCAT_VALUE, (@chunk_size * (seq_numbers.SEQ - 1)) + 1, @chunk_size) AS CHUNK,
           concat_input.TEMP_GROUPING_ID,
           seq_numbers.SEQ
    FROM (SELECT 1 AS SEQ UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
          SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL
          SELECT 9 UNION ALL SELECT 10 UNION ALL SELECT 11 UNION ALL SELECT 12 UNION ALL
          SELECT 13 UNION ALL SELECT 14 UNION ALL SELECT 15 UNION ALL SELECT 16 UNION ALL
          SELECT 17 UNION ALL SELECT 18 UNION ALL SELECT 19 UNION ALL SELECT 20 UNION ALL
          SELECT 21 UNION ALL SELECT 22 UNION ALL SELECT 23 UNION ALL SELECT 24 UNION ALL
          SELECT 25 UNION ALL SELECT 26 UNION ALL SELECT 27 UNION ALL SELECT 28 UNION ALL
          SELECT 29 UNION ALL SELECT 30 UNION ALL SELECT 31 UNION ALL SELECT 32 UNION ALL
          SELECT 33 UNION ALL SELECT 34 UNION ALL SELECT 35 UNION ALL SELECT 36 UNION ALL
          SELECT 37 UNION ALL SELECT 38 UNION ALL SELECT 39 UNION ALL SELECT 40 UNION ALL
          SELECT 41 UNION ALL SELECT 42 UNION ALL SELECT 43 UNION ALL SELECT 44 UNION ALL
          SELECT 45 UNION ALL SELECT 46 UNION ALL SELECT 47 UNION ALL SELECT 48 UNION ALL
          SELECT 49 UNION ALL SELECT 50) AS seq_numbers
             CROSS JOIN (SELECT CONCAT_VALUE, TEMP_GROUPING_ID FROM temp_all_concat_input) AS concat_input
    WHERE (@chunk_size * (seq_numbers.SEQ - 1)) + 1 <= CHAR_LENGTH(concat_input.CONCAT_VALUE)
    ORDER BY concat_input.TEMP_GROUPING_ID, seq_numbers.SEQ;
    
    TRUNCATE TABLE input_table;
    
    INSERT INTO input_table(CHUNK_ID, VALUE, GROUPING_ID)
    SELECT ROW_NUMBER() OVER (), CHUNK, TEMP_GROUPING_ID
    FROM temp_re_split_input;
    
    DROP TABLE temp_re_split_input;
    DROP TABLE temp_all_concat_input;
    
    -- note: table sequence generator should not be used, this is just an example that you need to update sequence if not using auto increment
    UPDATE table_sequence_generator
    SET ID_VAL = (SELECT COALESCE(MAX(CHUNK_ID) + 1, 1) FROM input_table)
    WHERE ID_NAME = 'InputTableSequence';
    
    COMMIT;
    
    SET group_concat_max_len = 1024;
    

    To increase the chunk size to 6000 you can just change the @chunk_size variable: SET @chunk_size = 6000;

    Notes:

    • the max text length split by this method will be chunk_size x 50, in case of 4000 chunk size it will be 200k chars
    • length of 4000 chars was set for Oracle compatibility, however this code is specific for MySQL (I used MySQL 8.0)
    • as stated above, this method should keep the right ordering of the chunks
    • @chunk_size is configurable
    • this can be used to reduce or increase chunk sizes
    • it will delete from input_table all the values where GROUPING_ID is null
    • it will generate different CHUNK_ID values
    • the last part is just an example how the table sequence can be updated, that part and the re-insert can be changed based on needs
    • CHAR_LENGTH is used instead of LENGTH function to support 2 bytes special characters too
    • no new data should be inserted in the table while running this script
    • in my example VARCHAR was used to store long strings, this was an external requirement, but type for VALUE column can be replaced by TEXT, MEDIUMTEXT or LONGTEXT and you will not face this problem with chunks, neither the input_table will not be necessary, column can be used directly in the original table