I have to queries that return a colossal amount of data on their own. I cannot use Repo.all
as doing so would materialize these into memory, which would then quickly run out.
So I am trying to push as much as I can to the pSQL DB, and force the DB to do as much work as possible.
My issue starts with 2 queries.
This ones counts fruits and veggies and aggregates everything into a neat map.
all_counts =
table_A
|> join(:left, [item_A], item_B in table_B,
on:
item_A.home_id == item_B.home_id and
item_A.path == item_B.path
)
|> select([unfiltered_item, filtered_item], %{
path: item_A.path,
item_fruits_count: coalesce(item_A.fruits, 0),
item_veggies_count: coalesce(item_B.veggies, 0),
dataset_id: item_A.home_id
})
|> subquery()
The second one, joins 2 tables as well (items and photos), with nothing fancy:
file_info =
table_C
|> join(:inner, [item], file in table_D,
on:
item.id == file.item_id and not file.deleted
)
|> select([item, file], %{
item_id: item.id,
home_id: item.home_id,
path: item.path,
photo_key: file.photo_key
})
|> subquery()
Now the problem is that I need to merge these 2 together. At first, one would think to do something like this:
result =
all_counts
|> join(:inner, [c], f in ^file_info, on: c.home_id == f.home_id and c.path == f.path)
|> select([c, f], %{
item_id: f.item_id,
home_id: f.home_id,
path: f.path,
photo_key: f.photo_key,
# ... you get the idea
})
|> Repo.all()
But this creates an issue, namely, the it will return so much data, the machines will run out of memory.
The approach I am using to solve this problem is to group items by home_id
and path
(since that is unique for each destination) and then return only a portion of the data I need, lets say, the top 3 items ordered by id.
Source: https://stackoverflow.com/a/16721423/1337392
Here is where my difficulties begin. I cannot use pSQL directly, I must use Ecto (for reasons beyond this post).
Normally I would use CTEs or row_number()
:
With ctes:
WITH cte AS
( SELECT name, value,
ROW_NUMBER() OVER (PARTITION BY name
ORDER BY value DESC
)
AS rn
FROM t
)
SELECT name, value, rn
FROM cte
WHERE rn <= 3
ORDER BY name, rn ;
With row_number
:
SELECT name, value, rn
FROM
( SELECT name, value,
ROW_NUMBER() OVER (PARTITION BY name
ORDER BY value DESC
)
AS rn
FROM t
) tmp
WHERE rn <= 3
ORDER BY name, rn ;
However, I am not familiar enough with Ecto to know how to use them.
With CTEs, I understand I should avoid them, as they serve no purpose in Ecto: https://hexdocs.pm/ecto/Ecto.Query.html#with_cte/3
With row_number()
I would need to partition by both home_id
and path
(2 fields) instead of one:
https://hexdocs.pm/ecto/Ecto.Query.WindowAPI.html#row_number/0
How do I get the result
, to return the top 3 results, grouped by home_id
and path
and ordered by item_id
using Ecto?
From what I gathered, there are two decent possible solutions to this conundrum.
row_number
+ over
One of them is using row_number()
with over()
from Ecto:
https://hexdocs.pm/ecto/Ecto.Query.WindowAPI.html#row_number/0
Assuming I join both file_info
with all_counts
in a single table, I can then perform the query as mentioned in the previous SO post I mentioned in my question:
file_info_with_counts
|> select([fi], %{
rn: over(row_number(), partition_by: [fi.home_id, fi.path], order_by: [asc: fi.item_id]),
item_id: fi.item_id,
# you get the idea ...
})
|> subquery()
IO.inspect(file_info_with_counts |> where([c], c.rn <= 3) |> Repo.all()
Which prints what I wanted.
However, as mentioned by some people in the community, this solution is rather old, and these days lateral joins seem to also cover this use case.
tops =
from top in "file_info_with_counts",
where: top.home_id == parent_as(:parent).home_id,
where: top.path == parent_as(:parent).path,
order_by: [asc: top.item_id],
limit: 3,
select: %{id: top.id}
from parent in "file_info_with_counts",
as: :parent,
group_by: [parent.home_id, parent.path],
lateral_join: top in subquery(tops),
on: true,
select: %{home_id: parent.home_id, path: parent.path, item_id: top.id}
This solution is not without merit, however, given my familiarity with row_number
I opted for that solution instead.
Unless there is a considerable performance difference between the two in favor of lateral joins, I will keep the previous solution.
For more info, here is the original source where I got these answers: https://elixirforum.com/t/how-to-get-the-top-x-results-of-a-given-category-using-ecto/53805/8