Search code examples
sqlamazon-athenapresto

Selecting common factors in different rows


I have a database which looks like below. It lists different people's favourite colours when they were children or after they became adults. It groups people when they have common favourite colour. For example, ID number 1 and 2 are grouped together as they have red in common. In such case, the second row (ID number 2, group b, favourite colour blue -)also is grouped as b since it is the same ID/person as row 3.

id group favourite_colour from_when points
1 b red child 30
2 b blue child 50
2 b red adult 50
3 c white adult 22
4 c white adult 24

Based on this, I want to create such a query as follwing:

select
   group,
   from_when,
   array_agg(distinct id),
   array_agg(distinct favourite_colour)

from database_example

group by
   group,
   from_when

having sum(points) > 70

The objectives is to group the people who belong to the same group by 'from_when' column and filter by summed points more than 70.

With the given example, this should return:

group from_when id favourite_colour
b child 1,2 red,blue

I want to add one more column in the result which can show how ID 1 and 2 are grouped together because, if we only look at the current query result, it looks like they don't belong to the same group with their different favourite colours. However, in reality, ID 1 and 2 are grouped together as ID 1 (child) and ID 2 (adult) which currently is not shown in the result share the same favourite colour.

As an example, an additional column (connection) like below is the type of results I'd like to include.

group from_when id favourite_colour connection
b child 1,2 red,blue 1-red-child,2-red-adult

How could this be achieved?


Solution

  • You can do it using CTEs.

    First CTE is the query where we get data grouped by group and from_when, while second CTE is used to get data grouped by group and favourite_colour and respected same condition, then Using some joins we get the Expected output :

    with cte_from_when as (
      select group_, from_when,
            array_agg(distinct id) as ids,
            array_agg(distinct favourite_colour) as colors
      from mytable
      group by group_, from_when
      having sum(points) > 70
    ),
    cte_favourite_colour as (
      select group_, favourite_colour
      from mytable
      group by group_, favourite_colour
      having sum(points) > 70
    )
    select cw.group_, cw.from_when, min(ids) as ids, min(cw.colors) as favourite_colour, array_agg(t.id || '-' || cc.favourite_colour || '-' || t.from_when) as connection
    from cte_from_when cw
    inner join mytable t on t.group_ = cw.group_ and t.favourite_colour = ANY(colors)
    inner join cte_favourite_colour cc on cc.group_ = t.group_ and cc.favourite_colour = t.favourite_colour
    group by cw.group_, cw.from_when
    

    Tested on Postgresql, check demo here