Search code examples
c#sqldatedatetimedatabase-design

What would be a good approach of storing inexact/imprecise dates?


In the application I'm working on, I need to store some metadata which includes dates that can be inexact. As an example, let's say that I want to store the release date of a product. Some examples of possible dates that could be in my data set:

  • 2023-01-10 12:00:00 - we know exactly when it was released (including time), e.g. they had a release event at this time
  • 2010-01-10 - don't know the time, but that was the exact date
  • 2000-01 - no exact date, but we know it was in January
  • 2000-Q1 - no exact date, but we know it was in the first quarter
  • 1995 - no exact date, we only know the year
  • 1990s - we don't know even the year, just that it was some time in the 90s

My goal is being able to provide a search filter that can handle all these different and inexact times. For context, I'm working in C#/.NET with PostgreSQL.

I'm thinking of storing the inexact date as a DateTime range (having a "Begin" and an "End"). Parsing basically means detecting the format (which can be done with a bit of Regex), and then generating the range based on that. Searching is a bit of a challenge, we basically need to calculate the intersection between this range and the queried range; I'm not sure if it can be done in SQL, but it can be done. One problem with this approach is converting this structure back to a string; I'm thinking of just storing the original string that I originally parsed.

Do you think this is a good approach? Do you have a better idea?


Solution

  • Rahatur's answer is a good one, just lacking in implementation details.

    To store an approximate date you'll need two pieces of information:

    • The approximation itself, as a DATE.
    • How precise that approximation is.

    In your question you mention a variety of options. Let's stuff them in an enumeration for documentation purposes (and use from your C# code) and add a couple more resolution options:

    enum EDatePrecision
    {
        Century = 1,
        Decade = 2,
        Year = 3,
        Quarter = 4,
        Month = 5,
        Week = 6,
        Day = 7
    }
    

    We'll store the precision as a TINYINT in the database.

    Next we need a way to determine the range of valid values covered by an approximation. This isn't too hard in SQL, so we'll define a couple of UDFs to do the work. In MSSQL that could be something like:

    CREATE FUNCTION dbo.ApproxDateLowerBound(@when DATE, @prec TINYINT)
    RETURNS DATE
    AS
    BEGIN
        RETURN CASE @prec
            WHEN 1 THEN DATEFROMPARTS((YEAR(@when) / 100) * 100, 1, 1)
            WHEN 2 THEN DATEFROMPARTS((YEAR(@when) / 10) * 10, 1, 1)
            WHEN 3 THEN DATEFROMPARTS(YEAR(@when), 1, 1)
            WHEN 4 THEN DATEFROMPARTS(YEAR(@when), CASE DATEPART(quarter, @when) WHEN 1 THEN 1 WHEN 2 THEN 4 WHEN 3 THEN 7 ELSE 10 END, 1)
            WHEN 5 THEN DATEFROMPARTS(YEAR(@when), MONTH(@when), 1)
            WHEN 6 THEN DATEADD(d, 1 - DATEPART(WEEKDAY, @when), @when)
            ELSE @when
        END;
    END;
    
    CREATE FUNCTION FUNCTION dbo.ApproxDateUpperBound(@when DATE, @prec TINYINT)
    RETURNS DATE
    AS
    BEGIN
        RETURN CASE @prec 
            WHEN 1 THEN DATEFROMPARTS((YEAR(@when) / 100) * 100 + 99, 12, 31)
            WHEN 2 THEN DATEFROMPARTS((YEAR(@when) / 10 ) * 10 + 9, 12, 31)
            WHEN 3 THEN DATEFROMPARTS(YEAR(@when), 12, 31)
            WHEN 4 THEN DATEADD(d, -1 , DATEADD(m, 3, DATEFROMPARTS(YEAR(@when), CASE DATEPART(quarter, @when) WHEN 1 THEN 1 WHEN 2 THEN 4 WHEN 3 THEN 7 ELSE 10 END, 1)))
            WHEN 5 THEN DATEADD(d, -1, DATEFROMPARTS(YEAR(@when), MONTH(@when) + 1, 1))
            WHEN 6 THEN DATEADD(d, 7 - DATEPART(WEEKDAY, @when), @when)
            ELSE @when
        END;
    END;
    

    Now you can get some defined boundaries for your approximations and use BETWEEN for comparison.


    A few notes...

    DATETIME Values

    Obviously the solution above is for dates only (notice all those places I used DATE instead of DATETIME). It can be extended to cover additional precision steps for hours, quarter hours, minutes, seconds... whatever you like. The logic should be fairly clear.

    Other Approximations

    The above works reasonably well for approximations like "some time in the 2020s" or "in the 21st century", but sometimes you mean "a month either side of" or "a few days around" instead. Just add these as additional options in the generating UDFs and all will be well. Widen the precision field size if you need to.

    Performance

    Computing the range every time is not necessarily ideal, performance-wise. You'd be better off storing that range somewhere, either in the table as extra fields or in a separate cache table keyed on the date and precision, then join that during your query:

    CREATE TABLE ApproximateDateRanges 
    (
        [Date] DATE NOT NULL,
        [Precision] TINYINT NOT NULL,
        RangeStart DATE NOT NULL,
        RangeEnd DATE NOT NULL
        
        PRIMARY KEY (Date, Precision)
    );
    
    CREATE TABLE Sample
    (
        ID INT IDENTITY PRIMARY KEY,
        ApproxDate DATE NOT NULL,
        ApproxPrec TINYINT NOT NULL
    );
    
    -- Some sample data
    
    INSERT INTO Sample(ApproxDate, ApproxPrec)
    VALUES
        ('2023-10-25', 4),
        ('2023-10-25', 5)
    
    -- Update ApproximateDateRanges after modifying rows in `Sample`
    MERGE INTO ApproximateDateRanges t
    USING (SELECT DISTINCT ApproxDate [Date], ApproxPrec [Precision] FROM Sample) s
    ON t.[Date] = s.[Date] AND t.[Precision] = s.[Precision]
    
    WHEN NOT MATCHED BY TARGET THEN 
        INSERT([Date], [Precision], RangeStart, RangeEnd)
        VALUES(s.[Date], s.[Precision], dbo.ApproxDateLowerBound(s.[Date], s.[Precision]), dbo.ApproxDateUpperBound(s.[Date], s.[Precision]))
    
    WHEN NOT MATCHED BY SOURCE THEN
        DELETE;
    
    -- Query
    
    DECLARE @targetDate DATE = '2023-12-01';
    
    SELECT s.*
    FROM Sample s
    JOIN ApproximateDateRanges r ON s.ApproxDate = r.[Date] AND s.ApproxPrec = r.[Precision]
    WHERE @targetDate BETWEEN r.RangeStart AND r.RangeEnd;
    

    This gives a nice balance between query speed (faster than calling the UDFs in the query) and storage (smaller than storing the ranges in-record, especially when there are a lot of collisions). It is however more sensitive to missed updates of the ApproximateDateRanges table.

    Pre-Calculated Ranges

    In some cases where the full range of valid dates is restricted - say to the dates in the first half of this century or the last 70 years or something - then you can pre-calculate all of the date ranges and add them to a table. This way you don't have to worry about updating the ApproximateDateRanges table after every edit to the data in other tables... but you do have to fill in about 2.6K records for each valid year you want to cover.

    Might not be for you. Not everyone wants to store a table with 0.25 million records that will mostly go unused. It's pretty fast though. (Check out how and why date dimensions are used in Data Warehouses.)