Search code examples
sqlpostgresqlannotationsantlr4

Annotate columns, with metadata in SQL


I want to annotate my DDL commands with metadata for mutiple data stores' SQL like PostgreSQL, SnowflakeSQL...

ex:

CREATE TABLE dummy (col_name type ... `*metadata_annotation*`, ...);

And I also want to make it mandatory that every column has a metadata annotation. Later I want to remove these annotaions and create a map out of it and restore the original DDL SQL commands. ex_map:
table: [col_name: metadata,...]
stripped out sql:

CREATE TABLE dummy (col_name type ... , ...);
  • I thought of making changes to the SQL grammar in antlr and configure everything through the parse tree visitor.

  • Or having a regex that can do the same, as there are no unambigious grammar available.


Solution

  • The SQL standard includes nothing for storing metadata.

    You can either:

    Comment your SQL source

    With your own defined syntax that you'll parse, inside standard SQL comments:

    CREATE TABLE dummy (col_name type ... /*@metadata_annotation*/, ...);
    

    but you've got not guaranty that the RDBMS will retain them (in information_schema for example).

    It is a case by case (RDBMS-by-RDBMS) solution, unless you control the DDL workflows.

    Use the RDBMS-provided facility

    There too, define your own naming.

    Example in PostgreSQL:

    CREATE TABLE dummy (col_name type ..., ...);
    
    COMMENT ON COLUMN dummy.col_name IS '@metadata_annotation';
    

    There you are even more tight to the RDBMS you want to run on.

    Have your own meta-schema
    CREATE TABLE my_annotations (s text, t text, c text, comment text);
    INSERT INTO my_annotations VALUES ('my_schema', 'dummy', 'col_name', 'metadata_annotation');
    

    This has the benefits of:

    • being cross-RDBMS
      you'll find some software vendors use it extensively to avoid being locked to a specific database… even when they are the RDBMS vendor too (e.g. Siebel).
      On a related topic, database schema migration tools do store their metadata this way too
    • well organized comments
      instead of a fragile syntax that a single character can make unparseable
    • being separated
      you're not on competition for a single field, between the metadata writers and those who would like more human-friendly comments
    • fine authorizations on who can write the schema
      or even a stored procedure to validate
    • automatization
      all your checking (is each column tagged?) can be done in SQL
    Dissociating input from storage

    Note that you can combine this solution (for storing) and solution 1 (for annotating), if you control the DDL workflow and can intercept annotations in the SQL to transform them to INSERTs to your metaschema.

    This would allow to validate the DDL SQL (any unannotated DDL would be rejected), but requires an extensive parser (not only on CREATE TABLE, but on ALTER TABLE too, be it to modify columns or to add new ones).

    On the other hand, the declarative way in a custom schema-describing table necessarily decorrelates the annotations from the creation,
    but you can run your validator (ensure that each column of information_schema.columns has a corresponding entry in your metaschema table) after each DDL file is played, and even on a regular basis (to ensure nothing bypassed your DDL workflow, creating columns either manually or via a store proc you couldn't catch).

    Redundancy of information

    Of course your metainformation will concentrate on functional annotations,
    not technical ones that you can infer from the RDBMS-provided catalog,
    like types and even constraints (information_schema.key_column_usage on PostgreSQL, MySQL, SQL Server and so on, all_cons_columns on Oracle).

    If you really need to cross functional annotations to technical reality,
    either have a view that aggregates both,
    or, if you get them dedicated entries in your metaschema (for example you need to dump your DB with all its description in order to later restore it on another RDBMS where the description will serve to reconstruct all constraints in that new RDBMS' specific DDL),
    make frequent consistency checks between the stored version and the database's catalog views.