Search code examples
sqlduckdb

Creating random pairs from a column


The problem

I'm trying to create random pairs from a column using DuckDB.

I have a column of protein accession numbers which looks like this:

┌──────────────┐
│ protein_upkb │
│   varchar    │
├──────────────┤
│ G1XNZ0       │
│ G1XP19       │
│ G1XP66       │
│ G1XP70       │
│ G1XPL1       │
│ G1XPQ7       │
│ G1XQ23       │
│ G1XQ44       │
│ G1XQ89       │
│ G1XQH2       │
├──────────────┤
│   10 rows    │
└──────────────┘

I'm trying to create random pairs of these protein ids such that they look like this:

┌────────────┬────────────┐
│     p1     │     p2     │
│  varchar   │  varchar   │
├────────────┼────────────┤
│ G1XNZ0     │ G1XP19     │
│ G1XP19     │ G1XP66     │
│ G1XP66     │ G1XP70     │
│ G1XP70     │ G1XPL1     │
│ G1XPL1     │ G1XPQ7     │
│ G1XPQ7     │ G1XQ23     │
│ G1XQ23     │ G1XQ44     │
│ G1XQ44     │ G1XQ89     │
│ G1XQ89     │ G1XQH2     │
│ G1XQH2     │ G1XNZ0     │
├────────────┴────────────┤
│ 10 rows       2 columns │
└─────────────────────────┘

N.B: This is just an example, I've thousands of IDs in the table in question.

Some things I've tried

Subqueries

I began by scrambling the order of the proteins by assigning some random number to each row and sorting by it.

CREATE VIEW proteins AS
  SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet');

SELECT * FROM proteins ORDER BY x DESC LIMIT 10;

Which results in

┌──────────────┬────────────────────┐
│ protein_upkb │         x          │
│   varchar    │       double       │
├──────────────┼────────────────────┤
│ A0A1H6HM63   │ 0.9999986232724041 │
│ A0A1G6CK58   │ 0.9999978158157319 │
│ A0A2C5XBA3   │ 0.9999923389405012 │
│ A0A1H9T955   │ 0.9999855090864003 │
│ Q05Q16       │ 0.9999655580613762 │
│ R5PE70       │  0.999956940067932 │
│ R5GUN0       │ 0.9999453630298376 │
│ A0A0L0UJ42   │ 0.9999357375781983 │
│ W6ZJY1       │ 0.9999311361461878 │
│ F6D0F2       │ 0.9999301459174603 │
├──────────────┴────────────────────┤
│ 10 rows                 2 columns │
└───────────────────────────────────┘

I then tried to create random pairs using subqueries. One column would be sorted by x descending, the other by x ascending.

Confusingly (to me), this only creates one random pair rather than the 255,622 I both expected and need.

cursor = duckdb.sql("""
SELECT
(SELECT protein_upkb FROM proteins ORDER BY x DESC) as p1,
(SELECT protein_upkb FROM proteins ORDER BY x ASC) as p2,
LIMIT 10;
""").show()
┌─────────┬─────────┐
│   p1    │   p2    │
│ varchar │ varchar │
├─────────┼─────────┤
│ Q28RH7  │ D8LJ06  │
└─────────┴─────────┘

SELECTing FROM two VIEWs

I figured that I can create two VIEWs, proteins1 and proteins2. I can then independently randomly sort them using random() as I've done before.

Finally, I can create pairs by selecting the protein_upkb column from each table.

Once more, I'm a bit surprised by the outcome.

p2 is a sequence of random proteins, while p1 is just one of the proteins.

cursor = duckdb.sql("""
CREATE VIEW proteins1 AS
  SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet') 
  ORDER BY x ASC;

CREATE VIEW proteins2 AS
  SELECT protein_upkb, random() as x FROM read_parquet('mini_proteins.parquet') 
  ORDER BY x ASC;

SELECT ps1.protein_upkb as p1, ps2.protein_upkb as p2,
FROM proteins1 as ps1, proteins2 as ps2
LIMIT 10;
""").show()
┌────────────┬────────────┐
│     p1     │     p2     │
│  varchar   │  varchar   │
├────────────┼────────────┤
│ A0A394DPL7 │ A0A1I3L166 │
│ A0A394DPL7 │ A0A0Q3WJP1 │
│ A0A394DPL7 │ A0A093SP34 │
│ A0A394DPL7 │ A0A127EQY9 │
│ A0A394DPL7 │ K6UP11     │
│ A0A394DPL7 │ A0A1I6M9F9 │
│ A0A394DPL7 │ A0A0Q3SWF8 │
│ A0A394DPL7 │ A0A069RD68 │
│ A0A394DPL7 │ S9ZHA8     │
│ A0A394DPL7 │ Q5P5L0     │
├────────────┴────────────┤
│ 10 rows       2 columns │
└─────────────────────────┘

Notebook

You can test this out in this Colab notebook.


Solution

  • There are multiple ways of doing it. The exact way would depend on your additional requirements.

    This one is very simple, get random sample twice, concatenate them together. You can get same protein twice, but you can exclude it with additional where clause though.

    duckdb.sql("""
      with cte as (
        select protein_upkb from proteins using sample(10)
      )
      select *
      from cte as c1
        positional join cte as c2
    """)
    
    ┌──────────────┬──────────────┐
    │ protein_upkb │ protein_upkb │
    │   varchar    │   varchar    │
    ├──────────────┼──────────────┤
    │ A0A0F6TCU1   │ A0A4C1ULV9   │
    │ D4YJT4       │ A0A3Q3FTK5   │
    │ A0A319DTU8   │ C6LIN2       │
    │ A0A1Q3D9X9   │ A0A1B3BCY8   │
    │ M5F4R3       │ M1NUZ3       │
    │ A0A553PJQ2   │ A0A165P0W9   │
    │ G7M9F2       │ A0A182JZX3   │
    │ A0A0Q1CIG2   │ G3HMK9       │
    │ C7YU85       │ A0A3Q2E7T6   │
    │ A0A199VI77   │ A0A0R1JQR6   │
    ├──────────────┴──────────────┤
    │ 10 rows           2 columns │
    └─────────────────────────────┘
    

    Or assign row_number at random, and then pivot in a way that even an odd rows create pairs:

    duckdb.sql("""
    with cte1 as (
      select ps1.protein_upkb, row_number() over(order by random()) as rn
      from proteins as ps1
    ), cte2 as (
      select
        protein_upkb,
        rn % 2 as col,
        rn // 2 as r
      from cte1
    )
    pivot cte2
    on col
    using any_value(protein_upkb)
    limit 10
    """)
    
    ───────┬────────────┬────────────┐
    │   r   │     0      │     1      │
    │ int64 │  varchar   │  varchar   │
    ├───────┼────────────┼────────────┤
    │ 66322 │ A0A1N7AVA0 │ A0A175R4H7 │
    │ 66325 │ K9FKM7     │ D8QP02     │
    │ 66327 │ A0A1I5KRT3 │ W0V524     │
    │ 66328 │ A0A4U2YU79 │ A0A452RP46 │
    │ 66334 │ A8RCK1     │ A0A165U1L8 │
    │ 66335 │ A0A3Q3QVI9 │ C7MCJ1     │
    │ 66336 │ Q3SLR9     │ A0A3B4B0Q2 │
    │ 66338 │ A0A1W1XBB2 │ A0A0B7J5C1 │
    │ 66339 │ A0A1I4KH70 │ A0A3S4SEU1 │
    │ 66340 │ A0A1W0D573 │ Q4ZR49     │
    ├───────┴────────────┴────────────┤
    │ 10 rows               3 columns │
    └─────────────────────────────────┘