Search code examples
sqlpostgresqlsqlalchemyleft-join

Why does adding an extra column and another LEFT OUTER JOIN from another table cause the value in my original query to change?


I'm building a Reddit clone that allows users to create posts, vote on other people's posts, and leave comments on other people's posts. When I query the database for a list of all of the posts, I would like my query to return data like the example table below.

Anticipated Table

post_id Number of Upvotes Number of Downvotes Number of Comments
1 2 0 5
2 1 0 0
3 0 1 0
4 0 1 4

Unfortunately, when I query the database, the result is unexpected.

Unexpected Table

post_id Number of Upvotes Number of Downvotes Number of Comments
1 10 0 5
2 1 0 0
3 0 1 0
4 0 4 4

The number of upvotes becomes equal to the number of comments on a post multiplied by the number of upvotes or downvotes. Below is the code.

SQL Alchemy Query and SQL Version

posts_query = db.query(
      models.Post.id,
      func.count(models.Vote.post_id).filter(models.Vote.upvote == True).label("num_upvotes"),
      func.count(models.Vote.post_id).filter(models.Vote.upvote == False).label("num_downvotes"),
      func.count(models.Comment.post_id).label('num_comments')
    ).join(
      models.Vote, models.Vote.post_id == models.Post.id, isouter=True
    ).join(
      models.Comment, models.Comment.post_id == models.Post.id, isouter=True
    ).group_by(
      models.Post.id
    ).all()
SELECT posts.id AS posts_id,
count(votes.post_id) FILTER (WHERE votes.upvote = true) AS num_upvotes,
count(votes.post_id) FILTER (WHERE votes.upvote = false) AS num_downvotes,
count(comments.post_id) AS num_comments 
FROM posts
LEFT OUTER JOIN votes ON votes.post_id = posts.id
LEFT OUTER JOIN comments ON comments.post_id = posts.id
GROUP BY posts.id

SQL Alchemy Models

class Post(Base):
  __tablename__ = "posts"

  id = Column(Integer, primary_key=True, nullable=False)
  title = Column(String, nullable=False)
  content = Column(String, nullable=False)
  published = Column(Boolean, server_default="True", nullable=False)
  created_at = Column(TIMESTAMP(timezone=True), 
    server_default=text('now()'), nullable=False)
  owner_id = Column(Integer, ForeignKey("users.id", 
    ondelete="CASCADE"), nullable=False)
  owner = relationship("User", backref="posts")


class User(Base): 
  __tablename__ = "users"

  id = Column(Integer, primary_key=True, nullable=False)
  username = Column(String, nullable=False, unique=True)
  email = Column(String, nullable=False, unique=True)
  password = Column(String, nullable=False)
  created_at = Column(TIMESTAMP(timezone=True), 
    server_default=text('now()'), nullable=False)


class Vote(Base):
  __tablename__ = "votes"

  user_id = Column(Integer, ForeignKey("users.id", 
    ondelete="CASCADE"), primary_key=True)
  user = relationship("User", backref="votes")
  post_id = Column(Integer, ForeignKey("posts.id", 
    ondelete="CASCADE"), primary_key=True)
  post = relationship("Post", backref="votes")
  upvote = Column(Boolean, nullable=False)

class Comment(Base):
  __tablename__ = "comments"

  id = Column(Integer, primary_key=True, nullable=False)
  owner_id = Column(Integer, ForeignKey("users.id", 
    ondelete="CASCADE"))
  owner = relationship("User", backref="comments")
  post_id = Column(Integer, ForeignKey("posts.id", 
    ondelete="CASCADE"))
  post = relationship("Post", backref="comments")
  content = Column(String, nullable=False)
  created_at = Column(TIMESTAMP(timezone=True), 
    server_default=text('now()'), nullable=False)

First Query

I've tried removing the second Left Join on the comments table and the query that attempts to count the number of comments. The following query makes the number of upvotes and downvotes act correctly.

posts_query = db.query(
      models.Post.id,
      func.count(models.Vote.post_id).filter(models.Vote.upvote == True).label("num_upvotes"),
      func.count(models.Vote.post_id).filter(models.Vote.upvote == False).label("num_downvotes"),
    ).join(
      models.Vote, models.Vote.post_id == models.Post.id, isouter=True
    ).group_by(
      models.Post.id
    ).all()
SELECT posts.id AS posts_id,
count(votes.post_id) FILTER (WHERE votes.upvote = true) AS num_upvotes,
count(votes.post_id) FILTER (WHERE votes.upvote = false) AS num_downvotes, 
FROM posts
LEFT OUTER JOIN votes ON votes.post_id = posts.id
GROUP BY posts.id
post_id Number of Upvotes Number of Downvotes
1 2 0
2 1 0
3 0 1
4 0 1

Second Query

I've also tried only removing the function that counts the number of comments while leaving the second LEFT OUTER JOIN in place.

posts_query = db.query(
      models.Post.id,
      func.count(models.Vote.post_id).filter(models.Vote.upvote == True).label("num_upvotes"),
      func.count(models.Vote.post_id).filter(models.Vote.upvote == False).label("num_downvotes")
    ).join(
      models.Vote, models.Vote.post_id == models.Post.id, isouter=True
    ).join(
      models.Comment, models.Comment.post_id == models.Post.id, isouter=True
    ).group_by(
      models.Post.id
    ).all()
SELECT posts.id AS posts_id,
count(votes.post_id) FILTER (WHERE votes.upvote = true) AS num_upvotes,
count(votes.post_id) FILTER (WHERE votes.upvote = false) AS num_downvotes 
FROM posts
LEFT OUTER JOIN votes ON votes.post_id = posts.id
LEFT OUTER JOIN comments ON comments.post_id = posts.id
GROUP BY posts.id
post_id Number of Upvotes Number of Downvotes
1 10 0
2 1 0
3 0 1
4 0 4

This leads me to believe the issue is with the second LEFT OUTER JOIN rather than the other counting function.

When I attempt to add the second LEFT OUTER JOIN to my query, the count that keeps track of the number of upvotes and downvotes returns an incorrect value. Why is my query giving me the Unexpected Table above and what I would need to change to return the Anticipated Table instead?


Solution

  • I don't know Alchemy, so here is my answer on SQL:

    Your mistake is that you cross join a post's votes with its comments. If you have two votes and three comments on a post, your joins create all six combinations (2 x 3 = 6). Then you count, but as you count in the cross joined result, you are counting votes and comments multifold.

    What you want to do instead is join vote counts and comment counts to the posts:

    SELECT 
      p.id AS posts_id,
      COALESCE(v.cnt_up, 0) AS num_upvotes,
      COALESCE(v.cnt_down, 0) AS num_downvotes,
      COALESCE(c.cnt, 0) AS num_comments 
    FROM posts p
    LEFT OUTER JOIN
    (
      SELECT
        post_id,
        COUNT(*) FILTER (WHERE upvote = true) AS cnt_up,
        COUNT(*) FILTER (WHERE upvote = false) AS cnt_down
      FROM votes
      GROUP BY post_id
    ) v ON v.post_id = p.id
    LEFT OUTER JOIN
    (
      SELECT post_id, COUNT(*) AS cnt
      FROM comments
      GROUP BY post_id
    ) c ON c.post_id = p.id
    ORDER BY p.id;
    

    Well, I probably shouldn't mention this, as it is considered bad practise, due to creating an unnecessarily large intermediate result and being prone to errors, when trying to add other aggregation results, but well, here it is: As long as the only aggregation function you use is COUNT, you can use your query with distinct ID counts:

    SELECT
      p.id AS posts_id,
      COUNT(DISTINCT v.id) FILTER (WHERE v.upvote = true) AS num_upvotes,
      COUNT(DISTINCT v.id) FILTER (WHERE v.upvote = false) AS num_downvotes,
      COUNT(DISTINCT c.id) AS num_comments 
    FROM posts p 
    LEFT OUTER JOIN votes v ON voves.post_id = p.id
    LEFT OUTER JOIN comments c ON c.post_id = p.id
    GROUP BY p.id
    ORDER BY p.id;