postgresql sqlalchemy query-performance postgresql-performance postgresql-10

Counting unread news in a large table

I got a pretty common (at least as I think) database structure: there are news (News(id, source_id)), each news has a source (Source(id, url)). Sources are aggregated to topics (Topic(id, title)) via TopicSource(source_id, topic_id). In addition there are users (User(id, name)) which can mark news as read via NewsRead(news_id, user_id). Here is a diagram to clear things up:

I want to count unread news in topics for specific user. The problem is News table is a big one (10^6 - 10^7 rows). Fortunately, I don't need to know exact count, it's ok to stop counting after a threshold returning this threshold as a counted value.

Following this answer for a one topic I came up with a following query:

SELECT t.topic_id, count(1) as unread_count
FROM (
 SELECT 1, topic_id
 FROM news n
   JOIN topic_source t ON n.source_id = t.source_id
   -- join news_read to filter already read news
   LEFT JOIN news_read r
     ON (n.id = r.news_id AND r.user_id = 1)
 WHERE t.topic_id = 3 AND r.user_id IS NULL
 LIMIT 10 -- Threshold
) t GROUP BY t.topic_id;

(query plan 1). This query takes about 50 ms on test db which is acceptable.

Now a want to select unread count for multiple topics. I tried to select like that:

SELECT
  t.topic_id,
  (SELECT count(1)
   FROM (SELECT 1 FROM news n
          JOIN topic_source tt ON n.source_id = tt.source_id
          LEFT JOIN news_read r
            ON (n.id = r.news_id AND r.user_id = 1)
          WHERE tt.topic_id = t.topic_id AND r.user_id IS NULL
          LIMIT 10 -- Threshold
        ) t) AS unread_count
FROM topic_source t WHERE t.topic_id IN (1, 2) GROUP BY t.topic_id;

(query plan 2). But for the reason unknown to me it takes about 1.5 s on test data while sum of individual queries should get about 0.2-0.3 s.

I'm clearly missing something here. Is there a mistake in second query? Is there a better (faster) way to select a count of unread news?

Additional info:

Here is a fiddle with DB structure and queries.
I'm using PostgresSQL 10 with SQLAlchemy (but raw SQL is ok for now).

Table sizes:

News - 10^6 - 10^7
User - 10^3
Source - 10^4
Topic - 10^3
TopicSource - 10^5
NewsRead - 10^6

UPD: query plans clearly show I messed up second query. Any cues are appreciated.

UPD2: I tried this query with lateral join which is supposed to simply run first (the fastest one) query for each topic_id:

SELECT
  id,
  count(*)
FROM topic t
  LEFT JOIN LATERAL (
     SELECT ts.topic_id
     FROM news n
       LEFT JOIN news_read r
         ON (n.id = r.news_id AND r.user_id = 1)
       JOIN topic_source ts ON n.source_id = ts.source_id
     WHERE ts.topic_id = t.id AND r.user_id IS NULL
     LIMIT 10
) p ON TRUE
WHERE t.id IN (4, 10, 12, 16)
GROUP BY t.id;

(query plan 3). But it seems that Pg planner has different opinion on this - it runs very slow seq scans and hash joins instead of index scans and merge joins.

Solution

After some benchmarking I've finally stopped on simple UNION ALL query and it's ten times faster than lateral join on my data:

SELECT
  p.topic_id,
  count(*)
FROM (
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 4 AND r.user_id IS NULL
              LIMIT 100
            ) t1
       UNION ALL
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 10 AND r.user_id IS NULL
              LIMIT 100
            ) t1
       UNION ALL
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 12 AND r.user_id IS NULL
              LIMIT 100
            ) t1
       UNION ALL
       SELECT *
       FROM (
              SELECT fs.topic_id
              FROM news n
                LEFT JOIN news_read r
                  ON (n.id = r.news_id AND r.user_id = 1)
                JOIN topic_source fs ON n.source_id = fs.source_id
              WHERE fs.topic_id = 16 AND r.user_id IS NULL
              LIMIT 100
            ) t1
     ) p
GROUP BY p.topic_id;

(execute plan)

Intuition here is that by specifying topic_id`s explicitly one provide Pg planner with enough information to build an effective plan.

From SQLAlchemy point of view it's pretty straightforward:

# topic_ids, user_id are defined elsewhere, e.g.
# topic_ids = [4, 10, 12, 16]
# user_id = 1
for topic_id in topic_ids:
    topic_query = (
        db.session.query(News.id, TopicSource.topic_id)
        .join(TopicSource, TopicSource.source_id == News.source_id)
        # LEFT JOIN NewsRead table to filter only unreads
        # (where News.user_id IS NULL)
        .outerjoin(NewsRead,
                   and_(NewsRead.news_id == News.id,
                        NewsRead.user_id == user_id))
        .filter(TopicSource.topic_id == topic_id,
                NewsRead.user_id.is_(None))
        .limit(100))
    topic_queries.append(topic_query)
# Unite queries with UNION ALL
union_query = topic_queries[0].union_all(*topic_queries[1:])
# Groups query by `topic_id` and count unreads
counts = (union_query
          # Using `with_entities(func.count())` to avoid
          # a subquery.  See link below for info:
          # https://gist.github.com/hest/8798884
          .with_entities(TopicSource.topic_id.label('topic_id'),
                         func.count().label('unread_count'))
          .group_by(TopicSource.topic_id))
result = counts.all()