Search code examples
sqlpostgresqlsampling

How to select all rows from n random groups in postgres


This question pertains to formulating a query in PostgreSQL Suppose I have the following table:

record   entity   docid   sentencid   
1   A   123   1231   
1   A   123   1232
1   A   100   1001
1   C   100   1002
1   B   121   1212
1   B   123   1234
2   B   102   1021
2   D   111   1111
2   D   102   1022
2   E   101   1011   
3   C   111   1115
3   C   111   1113
3   C   111   1114

Is there a PostgresSQL query that I can use to select all rows for n (or less) random groups of entities for each record in this table? Lets say n is 2. So the query should select all rows for record 3 and all rows for any 2 random entity groups of record 1 and 2. The final result should be ordered by accession, entity, docid, sentenceid.

Here is an example result with n=2:

record   entity   docid   sentencid   
1   A   100   1001
1   A   123   1231   
1   A   123   1232
1   B   121   1212
1   B   123   1234
2   D   102   1022
2   D   111   1111
2   E   101   1011   
3   C   111   1113
3   C   111   1114
3   C   111   1115

assuming that the entities A and B were randomly selected from the set of entities (A,B,C) for record 1 and the entities D and E were randomly selected from the set of entities (B,D,E) for record 2. The n entities should be selected randomly for each record.

I have searched extensively for answers to this question but did not find any working query. Thank you for looking into this!


Solution

  • You can use row_number with a random() order to randomly select n entities per record group. Then join this to your main table

    select * from Table1 t1
    join (
        select * from (
            select record, entity,
                row_number() over (partition by record order by random()) rn
            from Table1
            group by record, entity
        ) t where rn <= 2
    ) t2 on t1.record = t2.record and t1.entity = t2.entity
    

    Demo