arrays postgresql typeorm query-builder requestfiltering

PostgreSQL / TypeORM: search array in array column - return only the highest arrays' intersection

let's say we have 2 edges in a graph, each of them has many events observed on them, each event has one or several tags associated to them:

Let's say the first edge had 8 events with these tags: ABC ABC AC BC A A B.

Second edge had 3 events: BC, BC, C.

We want the user to be able to search

how many events occurred on every edge
by set of given tags, which are not mutually exclusive, nor they have a strict hierarchical relationship.

We represent this schema with 2 pre-aggregated tables:

Edges table:

+----+
| id |   
+----+
| 1  |
| 2  |  
+----+

EdgeStats table (which contains relation to Edges table via tag_id):

+------+---------+-----------+---------------+
| id   | edge_id | tags      | metric_amount |
+------+---------+-----------+---------------+
| 1    | 1       | [A, B, C] | 7             |
| 2    | 1       | [A, B]    | 7             |
| 3    | 1       | [B, C]    | 5             |
| 4    | 1       | [A, C]    | 6             |
| 5    | 1       | [A]       | 5             |
| 6    | 1       | [B]       | 4             |
| 7    | 1       | [C]       | 4             |
| 8    | 1       | null      | 7             | //null represents aggregated stats for given edge, not important here.
| 9    | 2       | [B, C]    | 3             |
| 10   | 2       | [B]       | 2             |
| 11   | 2       | [C]       | 3             |
| 12   | 2       | null      | 3             |
+------+---------+-----------+---------------+

Note that when table has tag [A, B] for example, it represents amount of events that had either one of this tag associated to them. So A OR B, or both.

Because user can filter by any combination of these tags, DataTeam populated EdgeStats table with all permutations of tags observed per given edge (edges are completely independent of each other, however I am looking for way to query all edges by one query).

I need to filter this table by tags that user selected, let's say [A, C, D]. Problem is we don't have tag D in the data. The expected return is:

+------+---------+-----------+---------------+
| id   | edge_id | tags      | metric_amount |
+------+---------+-----------+---------------+
| 4    | 1       | [A, C]    | 6             |
| 11   | 2       | [C]       | 3             |
+------+---------+-----------+---------------+

i.e. for each edge, the highest matching subset between what user search for and what we have in tags column. Rows with id 5 and 7 were not returned because information about them is already contained in row 4.

Why returning [A, C] for [A, C, D] search? Because since there are no data on edge 1 with tag D, then metric amount for [A, C] equals to the one for [A, C, D].

How do I write query to return this?

If you can just answer the question above, you can ignore what's bellow:

If I needed to filter by [A], [B], or [A, B], problem would be trivial - I could just search for exact array match:

  query.where("edge_stats.tags = :filter",
        {
          filter: [A, B],
        }
      )

However in EdgeStats table I don't have all tags combination user can search by (because it would be too many), so I need to find more clever solution.

Here is list of few possible solutions, all imperfect:

try exact match for all subsets of user's search term - so if user searches by tags [A, C, D], first try querying for [A, C, D], if no exact match, try for [C, D], [A, D], [A, C] and voila we got the match!
use @> operator:

  .where(
        "edge_stats.tags <@ :tags",
        {
          tags:[A, C, D],
        }
      )

This will return all rows which contained either A, C or D, so rows 1,2,3,4,5,7,11,13. Then it would be possible to filter out all but highest subset match in the code. But using this approach, we couldn't use SUM and similar functions, and returning too many rows is not good practice.

approach built on 2) and inspired by this answer:

      .where(
        "edge_stats.tags <@ :tags",
        {
          tags: [A, C, D],
        }
      )
      .addOrderBy("edge.id")
      .addOrderBy("CARDINALITY(edge_stats.tags)", "DESC")
      .distinctOn(["edge.id"]);

What it does is for every edge, find all tags containing either A, C, or D, and gets the highest match (high as array is longest) (thanks to ordering them by cardinality and selecting only one).

So returned rows indeed are 4, 11.

This approach is great, but when I use this as one filtration part of much larger query, I need to add bunch of groupBy statements, and essentially it adds bit more complexity than I would like.

I wonder if there could be a simpler approach which is simply getting highest match of array in table's column with array in query argument?

Solution

Your approach #3 should be fine, especially if you have an index on CARDINALITY(edge_stats.tags). However,

DataTeam populated EdgeStats table with all permutations of tags observed per given edge

If you're using a pre-aggregation approach instead of running your queries on the raw data, I would recommend to also record the "tags observed per given edge", in the Edges table.

That way, you can

SELECT s.edge_id, s.tags, s.metric_amount
FROM "EdgeStats" s
JOIN "Edges" e ON s.edge_id = e.id
WHERE s.tags = array_intersect(e.observed_tags, $1)

using the array_intersect function from here.