Search code examples
sqlpostgresqlrandomprobabilitywindow-functions

Random row selection with weighted filters in SQL/PostgreSQL


I have a questions table and I need to get X questions to prepare a test. The questions need to be filtered according to multiple criteria (subject, institution, area, etc.), each with different weights.

The filters weight are dynamically setted and normalized outside the query. Ex.:

  1. Subject 1 — 0.4
  2. Subject 2 — 0.1
  3. Subject 3 — 0.5
  4. Institution 1 — 0.2
  5. Institution 2 — 0.04
  6. Institution 3 — 0.76
  7. Area 1 — 1

Some other points:

  • Today, I have 10 different filters (subject, institution, area, etc.), but the user can select in a multiple and mixed way (ex.: 10 subjects, 5 institutions, 30 areas, etc.), like in the sample above.
  • The questions table have ~500k rows;
  • The filters are N — N with the questions;
  • After the filtering, I want to limit the returned rows;
  • If some filter can't offer any more questions, the other ones must be considered (remember: I want to prepare a test -- if I have questions left, they must be used)
  • I’m very concerned with the performance of this query.

To illustrate, if I didn’t want to weight the filters, I would do something like that:

SELECT
    *
FROM
    public.questions q
    INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
    INNER JOIN public.subjects s ON s.id = sq.subject_id
    INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
    INNER JOIN public.institutions i ON i.id = iq.institution_id
    INNER JOIN public.areas_questions aq ON aq.question_id = q.id
    INNER JOIN public.areas a ON a.id = aq.area_id
WHERE
    s.id IN :subjects
    AND a.id IN :areas
    AND i.id IN :institutions
ORDER BY
    random() limit 200

Desired output:

Question — Subject — Institution — Area

I thought in something along the lines:

  1. Create a CTE with the questions returned by the filter; must consider that the same question can be returned by more than one filter — do I need to evaluate each filter apart and UNION ALL then to solve this? Must assign, too, from what filter the question came from;
  2. Create another CTE with weights and the respective filter associated;
  3. JOIN the CTE’s, but at this point the questions must be grouped and the weights SUMmed;
  4. Apply a Window Function and return the results, limitted to X rows (LIMIT X).

How would you write such query / solve this problem?


Solution

  • Ok. Managed to solve it. Basically, used the strategy already outlined in the question and a little help from here -- I had already seen this post before, but I was (and still am) trying to solve in a more elegant way -- something like this but for multiple rows --, not needing to create the "bounds" by hand.

    Let's try step-by-step:

    Since the filters, with the weights, come from outside the schema, let's create a CTE:

    WITH filters (type, id, weight) AS (
        SELECT 'subject', '148232e0-dece-40d9-81e0-0fa675f040e5'::uuid, 0.5
        UNION SELECT 'subject', '854431bb-18ee-4efb-803f-185757d25235'::uuid, 0.4
        UNION SELECT 'area', 'e12863fb-afb7-45cf-9198-f9f58ebc80cf'::uuid, 1
        UNION SELECT 'institution', '7f56c89f-705e-45c7-98fb-fee470550edf'::uuid, 0.5
        UNION SELECT 'institution', '0066257b-b2e3-4ee8-8075-517a2aa1379e'::uuid, 0.5
    )
    

    Now, let's filter the rows, ignoring the weight (for now), so later we don't need to work with the whole table:

    WITH filtered_questions AS (
        SELECT
            q.id,
            s.id subject_id,
            a.id area_id,
            i.id institution_id
        FROM
            public.questions q
            INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
            INNER JOIN public.subjects s ON s.id = sq.subject_id
            INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
            INNER JOIN public.institutions i ON i.id = iq.institution_id
            INNER JOIN public.areas_questions aq ON aq.question_id = q.id
            INNER JOIN public.areas a ON a.id = aq.area_id
        WHERE
            subject_id IN (SELECT id from filters where type = 'subject')
            and institution_id IN (SELECT id from filters where type = 'institution')
            and area_id IN (SELECT id from filters where type = 'area')
    )
    

    The same question can be selected by multiple filters, increasing the chance of it being selected. We must update the weights to solve this.

    WITH filtered_questions_weights_sum AS (
        SELECT
            q.id,
            SUM(filters.weight) weight_sum
        FROM filtered_questions q
        INNER JOIN filters
        ON (filters.type = 'subject' AND q.subject_id IN(filters.id))
        OR (filters.type = 'area' AND q.area_id IN(filters.id))
        OR (filters.type = 'institution' AND q.institution_id IN(filters.id))
        GROUP BY q.id
    )
    

    Generating the bounds, like exposed here.

    WITH cumulative_prob AS (
        SELECT
            id,
            SUM(weight_sum) OVER (ORDER BY id) AS cum_prob
        FROM filtered_questions_weights_sum
    ),
    cumulative_bounds AS (
        SELECT
            id,
            COALESCE( lag(cum_prob) OVER (ORDER BY cum_prob, id), 0 ) AS lower_cum_bound,
            cum_prob AS upper_cum_bound
        FROM cumulative_prob
    )
    

    Generating the random series. Had to re-normalize (random() * (SELECT SUM(weight_sum)) because the weights were updated in a previous step. 10 is the number of rows that we want to return.

    WITH random_series AS (
        SELECT generate_series (1,10),random() * (SELECT SUM(weight_sum) FROM filtered_questions_weights_sum) AS R
    )
    

    And finally:

    SELECT
          id, lower_cum_bound, upper_cum_bound, R
    FROM random_series
    JOIN cumulative_bounds
    ON R::NUMERIC <@ numrange(lower_cum_bound::NUMERIC, upper_cum_bound::NUMERIC, '(]')
    

    And we get the following distribution:

    id                                   lower_cum_bound upper_cum_bound r                   
    ------------------------------------ --------------- --------------- ------------------- 
    380f46e9-f373-4b89-a863-05f484e6b3b6 0               2.0             0.41090718149207534 
    42bcb088-fc19-4272-8c49-e77999edd01c 2.0             3.9             3.4483200465794654  
    46a97f1d-789f-46e7-9d3b-bd881a22a32e 3.9             5.9             5.159445870062337   
    46a97f1d-789f-46e7-9d3b-bd881a22a32e 3.9             5.9             5.524481557868421   
    972d0296-acc3-4b44-b67d-928049d5e9c2 5.9             7.8             6.842470594821498   
    bdcc26f7-ccaf-4f8f-9e0b-81b9a6d29cdb 11.6            13.5            12.207371663767844  
    bdcc26f7-ccaf-4f8f-9e0b-81b9a6d29cdb 11.6            13.5            12.674184153741226  
    c935e3de-f1b6-4399-b5eb-ed3a9194eb7b 15.5            17.5            17.16804686235264   
    e5061aeb-53b7-4247-8404-87508c5ac723 21.4            23.4            22.622627633158118  
    f8c37700-0c3a-457e-8882-7c65269482ea 25.4            27.3            26.841821723571048  
    

    Putting it all together:

    WITH filters (type, id, weight) AS (
            SELECT 'subject', '148232e0-dece-40d9-81e0-0fa675f040e5'::uuid, 0.5
            UNION SELECT 'subject', '854431bb-18ee-4efb-803f-185757d25235'::uuid, 0.4
            UNION SELECT 'area', 'e12863fb-afb7-45cf-9198-f9f58ebc80cf'::uuid, 1
            UNION SELECT 'institution', '7f56c89f-705e-45c7-98fb-fee470550edf'::uuid, 0.5
            UNION SELECT 'institution', '0066257b-b2e3-4ee8-8075-517a2aa1379e'::uuid, 0.5
            )
        ,
        filtered_questions AS
        (
            SELECT
                q.id,
                SUM(filters.weight) weight_sum
            FROM
            public.questions q
            INNER JOIN public.subjects_questions sq ON q.id = sq.question_id
            INNER JOIN public.subjects s ON s.id = sq.subject_id
            INNER JOIN public.institutions_questions iq ON iq.question_id = q.id
            INNER JOIN public.institutions i ON i.id = iq.institution_id
            INNER JOIN public.activity_areas_questions aq ON aq.question_id = q.id
            INNER JOIN public.activity_areas a ON a.id = aq.activity_area_id
            INNER JOIN filters
                ON (filters.type = 'subject' AND s.id IN(filters.id))
                OR (filters.type = 'area' AND a.id IN(filters.id))
                OR (filters.type = 'institution' AND i.id IN(filters.id))
            WHERE
                s.id IN (SELECT id from filters where type = 'subject')
                and i.id IN (SELECT id from filters where type = 'institution')
                and a.id IN (SELECT id from filters where type = 'area')
            GROUP BY q.id
        )
        ,
        cumulative_prob AS (
            SELECT
                id,
                SUM(weight_sum) OVER (ORDER BY id) AS cum_prob
            FROM filtered_questions
        )
        ,
        cumulative_bounds AS (
            SELECT
                id,
                COALESCE( lag(cum_prob) OVER (ORDER BY cum_prob, id), 0 ) AS lower_cum_bound,
                cum_prob AS upper_cum_bound
            FROM cumulative_prob
        )
        ,
        random_series AS
        (
            SELECT generate_series (1,14),random() * (SELECT SUM(weight_sum) FROM filtered_questions) AS R
        )
    SELECT id, lower_cum_bound, upper_cum_bound, R
    FROM random_series
    JOIN cumulative_bounds
    ON R::NUMERIC <@ numrange(lower_cum_bound::NUMERIC, upper_cum_bound::NUMERIC, '(]')