Search code examples
sqlpostgresqlrandompermutationchunks

Is there a way to shuffle rows in a table into distinctive fixed size chunks using SQL only?


I have a very big table (~300 million rows) with the following structure: my_table(id, group, chunk, new_id), where chunk and new_id are set to NULL.

I want to set the rows of each group to a random chunk with distinct new_id in the chunk. Each chunk should be of fixed size of 100.

For example if group A has 1278 rows, they should go into 13 chunks (0-12), 12 chunks with 100 rows s.t. new_id are in range (0-99) and another single chunk with 78 rows s.t. new_id are in range (0-77).

The organization into chunks and within the chunks should be a random permutation where each row in A is assigned with a unique (chunk, new_id) tuple.

I'm successfully doing it using pandas but it takes hours, mostly due to memory and bandwidth limitations.

Is it possible to execute using only a SQL query? I'm using postgres 9.6.


Solution

  • You could do this with row_number():

    select id, group, rn / 100 chunk, rn % 100 new_id
    from (select t.*, row_number() over(order by random()) - 1 rn from mytable t) t
    

    The inner query assigns a random integer number to each record (starting at 0). The outer query does arithmetic to compute the chunk and new id.

    If you want an update query:

    update mytable t set chunk = x.rn / 3, new_id = x.rn % 3
    from (select id, row_number() over(order by random()) - 1 rn from mytable t) x
    where x.id = t.id
    

    Demo on DB Fiddle for a dataset of 20 records with chunks of 3 records .

    Before:

    id | grp | chunk | new_id
    -: | --: | ----: | -----:
     1 |   1 |  nullnull
     2 |   2 |  nullnull
     3 |   3 |  nullnull
     4 |   4 |  nullnull
     5 |   5 |  nullnull
     6 |   6 |  nullnull
     7 |   7 |  nullnull
     8 |   8 |  nullnull
     9 |   9 |  nullnull
    10 |  10 |  nullnull
    11 |  11 |  nullnull
    12 |  12 |  nullnull
    13 |  13 |  nullnull
    14 |  14 |  nullnull
    15 |  15 |  nullnull
    16 |  16 |  nullnull
    17 |  17 |  nullnull
    18 |  18 |  nullnull
    19 |  19 |  nullnull
    20 |  20 |  nullnull
    

    After:

    id | grp | chunk | new_id
    -: | --: | ----: | -----:
    19 |  19 |     0 |      0
    11 |  11 |     0 |      1
    20 |  20 |     0 |      2
    12 |  12 |     1 |      0
    14 |  14 |     1 |      1
    17 |  17 |     1 |      2
     3 |   3 |     2 |      0
     8 |   8 |     2 |      1
     5 |   5 |     2 |      2
    13 |  13 |     3 |      0
    10 |  10 |     3 |      1
     2 |   2 |     3 |      2
    16 |  16 |     4 |      0
    18 |  18 |     4 |      1
     6 |   6 |     4 |      2
     1 |   1 |     5 |      0
    15 |  15 |     5 |      1
     7 |   7 |     5 |      2
     4 |   4 |     6 |      0
     9 |   9 |     6 |      1