Search code examples
sqlpostgresqllogicaggregate-functions

SQL: how can I exclude certain lines from an aggregated result?


In the query I built, the result shows something like below:

SELECT name
      ,ARRAY_AGG(fruits ORDER BY time ASC) AS all_fruits
FROM table_fruits
name all_fruits
Person A Apple, Banana, Apple, Apple, Apple, Apple
Person B Apple, Apple, Apple, Banana, Apple, Banana
Person C Banana, Banana, Apple, Banana, Apple, Apple

I want to add one more column which shows the count of apples. However, I do not want to count apples that are followed by bananas. Therefore, the additional column should look like below.

name all_fruits count_of_apple
Person A Apple, Banana, Apple, Apple, Apple, Apple 4
Person B Apple, Apple, Apple, Banana, Apple, Banana 2
Person C Banana, Banana, Apple, Banana, Apple, Apple 2

How would I do this in SQL? The source includes time for when the fruit was eaten.


Solution

  • You can check:

    • for each row you have in your parent table, what "fruits" value follows that row with the LEAD window function
    • if the row in check is the last one, it won't have a next value (it will be NULL), so the COALESCE function will replace this NULL value with the current "fruits" value
    • hence you can assign 1 to your new column when the current "fruits" value is "Apple" and your next value is not "Banana", inside a CASE statement
    SELECT *,
           CASE WHEN fruits = 'Apple'
                 AND COALESCE(LEAD(fruits) OVER(
                                  PARTITION BY name 
                                  ORDER     BY time), 
                              fruits)                  <> 'Banana'
                THEN 1 
           END AS apples_not_after_bananas
    FROM table_fruits
    

    After this step, you can use your own code and add

    • the GROUP BY clause you missed, to aggregate over the "name" field
    • the SUM aggregation function over the previously generated 1s when apples were not followed by bananas.
    WITH cte AS (
        SELECT *,
               CASE WHEN fruits = 'Apple'
                     AND COALESCE(LEAD(fruits) OVER(
                                      PARTITION BY name 
                                      ORDER     BY time), 
                                  fruits)                  <> 'Banana'
                    THEN 1 
               END AS apples_not_after_bananas
        FROM table_fruits
    )
    SELECT name,
           ARRAY_AGG(fruits ORDER BY time ASC) AS all_fruits,
           SUM(apples_not_after_bananas)       AS count_of_apple
    FROM cte
    GROUP BY name
    

    Check the demo here.


    Edit: the banana came more than 1 day later

    If you want to add this specific condition, or in general any conditions, you need to work inside the CASE statement, which currently has two conditions, one on the current fruit and one on the next fruit.

    Checking whether the banana came more than 1 day later just means to add something like this:

               CASE WHEN fruits = 'Apple'
                     AND COALESCE(LEAD(fruits) OVER(
                                      PARTITION BY name 
                                      ORDER     BY time), 
                                  fruits)                  <> 'Banana'
                   --AND <if difference between the current next time value is greater than 1 day>
                    THEN 1 
               END AS apples_not_after_bananas