Search code examples
sqlpostgresqlsql-order-bydistinctcrosstab

Unexpected effect of filtering on result from crosstab() query


I have a crosstab() query like the following:

SELECT *
FROM crosstab(
 'SELECT row_name, extra1, extra2..., another_table.category, value
  FROM   table t
  JOIN   another_table ON t.field_id = another_table.field_id
  WHERE  t.field = certain_value AND t.extra1 = val1
  ORDER  BY row_name ASC',
 'SELECT category_name FROM category_name WHERE field = certain_value'
) AS ct(row_name text, extra1 text, extra2 text, ...)

Simplified example, the actual query is really complex and contains important information. The above query returns N result rows after filtering with table.extra1 = val1.

When I change the query as follows:

SELECT *
FROM crosstab(
 'SELECT row_name, extra1, extra2..., another_table.category, value
  FROM   table t
  JOIN   another_table ON t.field_id = another_table.field_id
  WHERE  t.field = certain_value AND t.extra1 IN (val1, ...) --> more values
  ORDER  BY row_name ASC',
 'SELECT category_name FROM category_name WHERE field = certain_value'
) AS ct(row_name text, extra1 text, extra2 text, ...)
WHERE extra1 = val1; --> condition on the result

Added more possible values table.extra1 IN (val1, ...) and a final condition WHERE extra1 = val1. Now I get fewer rows than from the original one. To make it worse, if I add yet more values to IN (val1, ...), I get yet fewer rows. Why is that?


Solution

  • extra1, extra2, ... are "extra columns" in crosstab terminology.
    The manual for the tablefunc module explains the rules:

    It may also have one or more “extra” columns. The row_name column must be first. The category and value columns must be the last two columns, in that order. Any columns between row_name and category are treated as “extra”. The “extra” columns are expected to be the same for all rows with the same row_name value.

    And further down:

    The output row_name column, plus any “extra” columns, are copied from the first row of the group.

    Bold emphasis on key parts by me.

    You only sort by row_name:

    ORDER  BY row_name ASC
    

    Does not matter in the first example where you filter with:

    WHERE ... t.extra1 = 'val1'  -- single quotes by me
    

    All input row have extra1 = 'val1' anyway. But it matters in the second example where you filter with:

    WHERE ... t.extra1 IN('val1', ...) --> More values
    

    Now, the first bolded requirement above is violated for the extra column extra1. While the sort order of the first input query is non-deterministic, resulting values for the "extra" column extra1 are picked arbitrarily. The more possible values for extra1, the fewer rows will end up having 'val1': that's what you observed.

    You can still make it work: to report extra1 = 'val1' for every row_name that has at least one of those, change the ORDER BY to:

    ORDER  BY row_name, (extra1 <> 'val1')
    

    Sorts 'val1' on top. Explanation for that boolean expression (with links to more):

    Other "extra" columns are still chosen arbitrarily while the sort order is not deterministic.

    Crosstab basics: