Search code examples
elixirecto

How to get the top X results of a given category using Ecto?


Background

I have to queries that return a colossal amount of data on their own. I cannot use Repo.all as doing so would materialize these into memory, which would then quickly run out.

So I am trying to push as much as I can to the pSQL DB, and force the DB to do as much work as possible.

My issue starts with 2 queries.

This ones counts fruits and veggies and aggregates everything into a neat map.

all_counts =
      table_A
      |> join(:left, [item_A], item_B in table_B,
        on:
          item_A.home_id == item_B.home_id and
            item_A.path == item_B.path
      )
      |> select([unfiltered_item, filtered_item], %{
        path: item_A.path,
        item_fruits_count: coalesce(item_A.fruits, 0),
        item_veggies_count: coalesce(item_B.veggies, 0),
        dataset_id: item_A.home_id
      })
      |> subquery()

The second one, joins 2 tables as well (items and photos), with nothing fancy:

file_info =
      table_C
      |> join(:inner, [item], file in table_D,
        on:
          item.id == file.item_id and not file.deleted
      )
      |> select([item, file], %{
        item_id: item.id,
        home_id: item.home_id,
        path: item.path,
        photo_key: file.photo_key
      })
      |> subquery()

Problem

Now the problem is that I need to merge these 2 together. At first, one would think to do something like this:

result =
      all_counts
      |> join(:inner, [c], f in ^file_info, on: c.home_id == f.home_id and c.path == f.path)
      |> select([c, f], %{
        item_id: f.item_id,
        home_id: f.home_id,
        path: f.path,
        photo_key: f.photo_key,
        # ... you get the idea
      })
      |> Repo.all()

But this creates an issue, namely, the it will return so much data, the machines will run out of memory.

Approach

The approach I am using to solve this problem is to group items by home_id and path (since that is unique for each destination) and then return only a portion of the data I need, lets say, the top 3 items ordered by id.

Source: https://stackoverflow.com/a/16721423/1337392

Here is where my difficulties begin. I cannot use pSQL directly, I must use Ecto (for reasons beyond this post).

Normally I would use CTEs or row_number():

With ctes:

 WITH cte AS
  ( SELECT name, value,
           ROW_NUMBER() OVER (PARTITION BY name
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  )
SELECT name, value, rn
FROM cte
WHERE rn <= 3
ORDER BY name, rn ;

With row_number:

SELECT name, value, rn
FROM 
  ( SELECT name, value,
           ROW_NUMBER() OVER (PARTITION BY name
                              ORDER BY value DESC
                             )
             AS rn
    FROM t
  ) tmp 
WHERE rn <= 3
ORDER BY name, rn ; 

However, I am not familiar enough with Ecto to know how to use them.

With CTEs, I understand I should avoid them, as they serve no purpose in Ecto: https://hexdocs.pm/ecto/Ecto.Query.html#with_cte/3

With row_number() I would need to partition by both home_id and path (2 fields) instead of one: https://hexdocs.pm/ecto/Ecto.Query.WindowAPI.html#row_number/0

Question

How do I get the result, to return the top 3 results, grouped by home_id and path and ordered by item_id using Ecto?


Solution

  • Solution(s)

    From what I gathered, there are two decent possible solutions to this conundrum.

    row_number + over

    One of them is using row_number() with over() from Ecto: https://hexdocs.pm/ecto/Ecto.Query.WindowAPI.html#row_number/0

    Assuming I join both file_info with all_counts in a single table, I can then perform the query as mentioned in the previous SO post I mentioned in my question:

    file_info_with_counts
    |> select([fi], %{
            rn: over(row_number(), partition_by: [fi.home_id, fi.path], order_by: [asc: fi.item_id]),
            item_id: fi.item_id,
           # you get the idea ...
          })
    |> subquery()
    
    IO.inspect(file_info_with_counts |> where([c], c.rn <= 3) |> Repo.all()
    

    Which prints what I wanted.

    Source: https://elixirforum.com/t/how-to-get-the-top-x-results-of-a-given-category-using-ecto/53805/9?u=fl4m3ph03n1x

    Lateral inner joins

    However, as mentioned by some people in the community, this solution is rather old, and these days lateral joins seem to also cover this use case.

    tops = 
      from top in "file_info_with_counts", 
        where: top.home_id == parent_as(:parent).home_id,
        where: top.path == parent_as(:parent).path,
        order_by: [asc: top.item_id],
        limit: 3,
        select: %{id: top.id}
    
    from parent in "file_info_with_counts",
      as: :parent,
      group_by: [parent.home_id, parent.path],
      lateral_join: top in subquery(tops), 
      on: true,
      select: %{home_id: parent.home_id, path: parent.path, item_id: top.id}
    

    Source: https://elixirforum.com/t/how-to-get-the-top-x-results-of-a-given-category-using-ecto/53805/11?u=fl4m3ph03n1x

    This solution is not without merit, however, given my familiarity with row_number I opted for that solution instead.

    Unless there is a considerable performance difference between the two in favor of lateral joins, I will keep the previous solution.

    For more info, here is the original source where I got these answers: https://elixirforum.com/t/how-to-get-the-top-x-results-of-a-given-category-using-ecto/53805/8