Schema design for referencing two different tables

I have a table of Posts, and comments that are associated with a Post.

PostsTable
-----
Id
Title
Content
Created

CommentsTable
-------------
Id
PostId
Content
Created

I need to run analysis on the comments and posts for detection of specific words therein (and trend over time from the post or comment date):

PostCommentDetections
--------
Id
FoundWord
PostId
CommentId

I can use the above, and if PostId is set I know its for a word detection in a post title, and if CommendId it is for a comment detection.

However this will need careful access in my business logic (via c# and postgres) to keep this from getting corrupted (although invalid there would be nothing to stop both PostID and CommentId from being set at the same time, I could also see cases of duplicate detection entries).

Before I start to code this (and edge cases to prevent duplicates) is there any way that the schema can be designed with additional tables or approaches/considerations to ensure integrity of the data with the above intent is inherently baked in?

Solution

A table design like that is not a big deal, you can use CHECK and UNIQUE constraints to maintain integrity.

The usual argument that the PostCommentDetections be split into 2 tables, one for PostDetections, the other for CommentDetections, comes from the application developer where they intuitively understand and can work with simple required relationships with 1:N cardinality, but tend to struggle with a table that has multiple optional FKs.

With a combined or dual FK table like this the schema itself doesn't give the developer a lot of clues as the business requirement and how the fields should be treated, they might mistakenly assume that both PostId and CommentId are required. They will need to read the requirements document to gain a better understanding... or at least they should, many a DBA has probably struggled with this and gone with the path of least resistance, that is to NOT combine the tables.

Because the shape of the data, the writing and reading patterns for the Detections records will be unique from other tables in your schema, and due to the simplicity of the structure, a single table for PostCommentDetections makes enough sense to the DBA in me.

Make the FKs in PostCommentDetections optional, that is to say they accept NULL
Add a CHECK Constraint to prevent PostId AND CommentId from containing a non-null value at the same time, whilst also requiring that one of them IS NOT NULL
Add a UNIQUE Constraint to prevent duplicate combinations of PostId, CommentId, FoundWord

This will maintain the simple table structure and prevent the developers from getting it wrong. In SQL Server that looks like this: (See fiddle: http://sqlfiddle.com/#!18/70be2)

CREATE TABLE PostsTable (
  Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
  Title VARCHAR(100) NOT NULL,
  Content VARCHAR(2000) NULL,
  Created DateTimeOffset NOT NULL DEFAULT(SysDateTimeOffset())
);
CREATE TABLE CommentsTable (
  Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
  PostId INT NOT NULL FOREIGN KEY REFERENCES PostsTable(Id),
  Content VARCHAR(2000) NULL,
  Created DateTimeOffset NOT NULL DEFAULT(SysDateTimeOffset())
);
CREATE TABLE PostCommentDetections (
  Id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
  PostId INT NULL FOREIGN KEY REFERENCES PostsTable(Id),
  CommentId INT NULL FOREIGN KEY REFERENCES CommentsTable(Id),
  FoundWord VARCHAR(50) NOT NULL,
  CHECK(PostId IS NOT NULL AND CommentId IS NULL OR PostId IS NULL AND CommentId IS NOT NULL),
  CONSTRAINT PostCommentDetections_Word_CTX UNIQUE (PostId, CommentId, FoundWord)
);

The only major drawback to this design is that the intent is not obvious, but its a simple enough concept and these constraints will mean that only the data that is valid for your use case will make it into the detections table.

Something to consider, Is this a good design idea in the first place? Whilst you can make it work in the primary application database, we usually do the level of analytics you are describing in a separated Business Intelligence layer, either a datawarehouse or some other OLAP or online indexing service.

So while you can do this directly in SQL, you might find there are better tools or services that can give you the same level of deep insights without having to roll the algorithm on your own, focus on making sure you have the data at all, then you can process it in all sorts of different dimensions.

(Unverified) PostgreSQL Equivalent:

CREATE TABLE PostsTable (
  Id SERIAL PRIMARY KEY NOT NULL,
  Title VARCHAR(100) NOT NULL,
  Content VARCHAR(2000) NULL,
  Created timestamp NOT NULL DEFAULT(NOW())
);
CREATE TABLE CommentsTable (
  Id SERIAL PRIMARY KEY NOT NULL,
  PostId INT NOT NULL,
  Content VARCHAR(2000) NULL,
  Created timestamp NOT NULL DEFAULT(NOW()),
  CONSTRAINT FK_Post FOREIGN KEY(PostId) REFERENCES PostsTable (Id)
);
CREATE TABLE PostCommentDetections (
  Id SERIAL PRIMARY KEY NOT NULL,
  PostId INT NULL,
  CommentId INT NULL,
  Content VARCHAR(2000) NULL,
  FoundWord VARCHAR(50) NOT NULL,
  CONSTRAINT FK_Post FOREIGN KEY(PostId) REFERENCES PostsTable (Id),
  CONSTRAINT FK_Comment FOREIGN KEY(CommentId) REFERENCES CommentsTable (Id),
  CHECK (PostId IS NOT NULL AND CommentId IS NULL OR PostId IS NULL AND CommentId IS NOT NULL),
  UNIQUE (PostId, CommentId, FoundWord)
);