Search code examples
postgresqlsqlalchemyquery-performancepostgresql-performancepostgresql-10

Counting unread news in a large table


I got a pretty common (at least as I think) database structure: there are news (News(id, source_id)), each news has a source (Source(id, url)). Sources are aggregated to topics (Topic(id, title)) via TopicSource(source_id, topic_id). In addition there are users (User(id, name)) which can mark news as read via NewsRead(news_id, user_id). Here is a diagram to clear things up: db diagram

I want to count unread news in topics for specific user. The problem is News table is a big one (10^6 - 10^7 rows). Fortunately, I don't need to know exact count, it's ok to stop counting after a threshold returning this threshold as a counted value.

Following this answer for a one topic I came up with a following query:

SELECT t.topic_id, count(1) as unread_count
FROM (
 SELECT 1, topic_id
 FROM news n
   JOIN topic_source t ON n.source_id = t.source_id
   -- join news_read to filter already read news
   LEFT JOIN news_read r
     ON (n.id = r.news_id AND r.user_id = 1)
 WHERE t.topic_id = 3 AND r.user_id IS NULL
 LIMIT 10 -- Threshold
) t GROUP BY t.topic_id;

(query plan 1). This query takes about 50 ms on test db which is acceptable.

Now a want to select unread count for multiple topics. I tried to select like that:

SELECT
  t.topic_id,
  (SELECT count(1)
   FROM (SELECT 1 FROM news n
          JOIN topic_source tt ON n.source_id = tt.source_id
          LEFT JOIN news_read r
            ON (n.id = r.news_id AND r.user_id = 1)
          WHERE tt.topic_id = t.topic_id AND r.user_id IS NULL
          LIMIT 10 -- Threshold
        ) t) AS unread_count
FROM topic_source t WHERE t.topic_id IN (1, 2) GROUP BY t.topic_id;

(query plan 2). But for the reason unknown to me it takes about 1.5 s on test data while sum of individual queries should get about 0.2-0.3 s.

I'm clearly missing something here. Is there a mistake in second query? Is there a better (faster) way to select a count of unread news?

Additional info:

Table sizes:

News - 10^6 - 10^7
User - 10^3
Source - 10^4
Topic - 10^3
TopicSource - 10^5
NewsRead - 10^6

UPD: query plans clearly show I messed up second query. Any cues are appreciated.

UPD2: I tried this query with lateral join which is supposed to simply run first (the fastest one) query for each topic_id:

SELECT
  id,
  count(*)
FROM topic t
  LEFT JOIN LATERAL (
     SELECT ts.topic_id
     FROM news n
       LEFT JOIN news_read r
         ON (n.id = r.news_id AND r.user_id = 1)
       JOIN topic_source ts ON n.source_id = ts.source_id
     WHERE ts.topic_id = t.id AND r.user_id IS NULL
     LIMIT 10
) p ON TRUE
WHERE t.id IN (4, 10, 12, 16)
GROUP BY t.id;

(query plan 3). But it seems that Pg planner has different opinion on this - it runs very slow seq scans and hash joins instead of index scans and merge joins.


Solution

  • After some benchmarking I've finally stopped on simple UNION ALL query and it's ten times faster than lateral join on my data:

    SELECT
      p.topic_id,
      count(*)
    FROM (
           SELECT *
           FROM (
                  SELECT fs.topic_id
                  FROM news n
                    LEFT JOIN news_read r
                      ON (n.id = r.news_id AND r.user_id = 1)
                    JOIN topic_source fs ON n.source_id = fs.source_id
                  WHERE fs.topic_id = 4 AND r.user_id IS NULL
                  LIMIT 100
                ) t1
           UNION ALL
           SELECT *
           FROM (
                  SELECT fs.topic_id
                  FROM news n
                    LEFT JOIN news_read r
                      ON (n.id = r.news_id AND r.user_id = 1)
                    JOIN topic_source fs ON n.source_id = fs.source_id
                  WHERE fs.topic_id = 10 AND r.user_id IS NULL
                  LIMIT 100
                ) t1
           UNION ALL
           SELECT *
           FROM (
                  SELECT fs.topic_id
                  FROM news n
                    LEFT JOIN news_read r
                      ON (n.id = r.news_id AND r.user_id = 1)
                    JOIN topic_source fs ON n.source_id = fs.source_id
                  WHERE fs.topic_id = 12 AND r.user_id IS NULL
                  LIMIT 100
                ) t1
           UNION ALL
           SELECT *
           FROM (
                  SELECT fs.topic_id
                  FROM news n
                    LEFT JOIN news_read r
                      ON (n.id = r.news_id AND r.user_id = 1)
                    JOIN topic_source fs ON n.source_id = fs.source_id
                  WHERE fs.topic_id = 16 AND r.user_id IS NULL
                  LIMIT 100
                ) t1
         ) p
    GROUP BY p.topic_id;
    

    (execute plan)

    Intuition here is that by specifying topic_id`s explicitly one provide Pg planner with enough information to build an effective plan.

    From SQLAlchemy point of view it's pretty straightforward:

    # topic_ids, user_id are defined elsewhere, e.g.
    # topic_ids = [4, 10, 12, 16]
    # user_id = 1
    for topic_id in topic_ids:
        topic_query = (
            db.session.query(News.id, TopicSource.topic_id)
            .join(TopicSource, TopicSource.source_id == News.source_id)
            # LEFT JOIN NewsRead table to filter only unreads
            # (where News.user_id IS NULL)
            .outerjoin(NewsRead,
                       and_(NewsRead.news_id == News.id,
                            NewsRead.user_id == user_id))
            .filter(TopicSource.topic_id == topic_id,
                    NewsRead.user_id.is_(None))
            .limit(100))
        topic_queries.append(topic_query)
    # Unite queries with UNION ALL
    union_query = topic_queries[0].union_all(*topic_queries[1:])
    # Groups query by `topic_id` and count unreads
    counts = (union_query
              # Using `with_entities(func.count())` to avoid
              # a subquery.  See link below for info:
              # https://gist.github.com/hest/8798884
              .with_entities(TopicSource.topic_id.label('topic_id'),
                             func.count().label('unread_count'))
              .group_by(TopicSource.topic_id))
    result = counts.all()