Search code examples
databaserelational-databasedatabase-schemahierarchical-data

Modeling a data-store browser


I have a connection-object browser that I want to allow a user to view of various data sources they are connected to.The viewer of the objects look something like this:

  • Connection: Remote.1234.MySQL (3 level source)

    • Database: Sales
      • Table: User
        • Field: Name -- CHAR(80)
        • Field: Age -- INT32
      • Table: Product
        • ...
      • Table: Purchase
        • ...
    • Database: Other
      • ...
  • Connection: Remote.abc.ElasticSearch (2 level source)

    • Index: Inventory
      • Field: ID -- INTEGER
      • Field: Product -- STRING
      • ...
  • Connection: Local.xyz.MongoDB (3 level source)

    • Database: Mail
      • Collection: Users
        • Field: MailboxID -- INTEGER
        • Field: Name -- STRING
      • Collection: Documents
        • ...
  • Connection: Local.xyz.SQLServer (4 level source)

    • Database: Main
      • Schema: Public
        • Table: user
          • Field: Name -- STRING
    • Database: History
      • ...

In other words, a 'Source' is a hierarchy of a known number of levels and a known 'name' for each level. While the entire hierarchy is variable, the hierarchy of any given source will always have the same number of levels and name. What might be a good way to model this relationally? My thought was to have the following:

Connection:

  • id
  • host
  • (other details)

SourceType:

  • id
  • Name

SourceTypeLevelMapping:

  • SourceTypeID
  • level (int)
  • name

ThreeLevelSource_Level1: # e.g., Database

  • ID
  • ParentID (ConnectionID)
  • Name
  • (other details)

ThreeLevelSource_Level2: # e.g., Table

  • ID
  • ParentID (Level1ID)
  • Name
  • (other details)

ThreeLevelSource_Level3: # e.g., Field

  • ID
  • ParentID (Level2ID)
  • FieldName
  • FieldType
  • (other details)

Then do the same for the other level-ed hierarchies:

  • TwoLevelSource_Level1, TwoLevelSource_Level2
  • FourLevelSource_Level1, FourLevelSource_Level2, FourLevelSource_Level3, FourLevelSource_Level4

So basically define the known hierarchies and for each new source that we add we would attach it to one of the known hierarchy levels. The alternative approach I was thinking of doing is to create a new hierarchy for each new source, but then we would be looking at literally hundreds of tables if we were to allow access to 25-50 sources.

What might be a good way to model this type of hierarchical data?

(Also, yes I am familiar with the existing general approaches for modeling hierarchical data as delineated here -- What are the options for storing hierarchical data in a relational database?, How can you represent inheritance in a database? -- the below is not a duplicate.)


Solution

  • Relational Solution

    Responding to the relational-database and hierarchic-data tags, the latter being pedestrian in the former.

    1.1 Preliminary

    Due to the requirement for, and the difference between:

    • the genuine SQL Platforms (conformation to the Standard; server architecture, unified language; etc) and
    • the pretend "SQL" programs (no architecture; bits of language spread across those programs; no Transactions; no ACID; etc) that provide no compliance to the Standard, and therefore use the term incorrectly, and
    • the non-SQLs

    Thus I use Record and Field to cover all possibilities, instead of the Relational terms, which would convey Relational definitions.

    All possibilities are catered for, but a Relational and SQL-compliant approach (eg. MS SQL Server) is taken as the best method, due to its 40-year establishment and maturity, and the absence of alternatives.

    • The collection of SQL Platforms; pretend "SQL" applications; and non-SQL suites, are labelled DataSource.

    1.2 Compliance

    This solution is 100% Relational: Codd's Relational Model, not the substandard alternatives marketed by the academics as "relational":

    • It can be implemented in any SQL compliant Platform

    • It has Relational Integrity (which is logical, beyond Referential Integrity, which is SQL and physical); Relational Power; and Relational Speed.

    • All update interaction is simple, via SQL ACID Transactions.

    • No warranties are given for pretend "SQLs" and non-SQLs.


    2 Solution

    2.1 Concept

    I appreciate that as a developer, your focus in on the data values and how to retrieve it. However, two levels of definition are required first, in order to support the third data level:

    1. Catalogue Potential
      Blue (Reference cluster).
      The DtaSources and definition that is available in the market, which the organisation might use. Let's say 42, as per your descriptions.

      • I would entrust this only to a developer, not an user_admin, because the set up it critical (the lower levels depend on it), and it describes the physical capability and limitations of each DataSource.
    2. Catalogue Actual
      Green (Identification cluster).
      The DataSources and definition that are actually contracted and used by the organisation. Let's say 12. At this point we have connection addresses; ports; and users. It is constrained CataloguePotential, directly, and via CHECKS that call Functions.

      This level defines the content (the tables that actually exist), it contains no data values.

      • Maintaining an SQL mindset, because that would be the most prudent, given that it is an established Standard, with 40 years of maturity, because it gives us the most flexibility: the CatalogueActual forms the SQL Catalogue.

      • Likewise, I have used the terms Record and Field for the objects in the collective, rather than Table and Column, which would imply Relational and SQL meanings.

      • SQL Platform
        This level can be populated automatically by the program querying the SQL Catalogue.

      • "SQL" applications and non-SQL suites
        Population is manual due to the absence of a Catalogue. It can be done by an user_admin. The constraint would be your program attempting a trial query to validate the user-supplied table definition.

    3. Current Data
      Yellow (Transaction cluster)
      The current data, that the user has queried from the DataSources, via his Connection, for the webpage. The assumption is, I have taken the user::webpage to be central, and governing (one user per Connection; one user per webpage), not the OO Object.

      • if the OO Objects are not reliable (depends on the library you use), or there is one set of Objects across all user-webpages, more Constraints need to be added.

    2.2 Method

    You need:

    1. Simple Hierarchy
      a single-parent hierarchy to replicate the fixed levels of definition in the Catalogue in the SQL servers, as well as the variable levels in the constructed catalogue for the pretend "SQLs" and the non-SQLs.

      • Relational Hierarchies are fully defined, along with SQL implementation details, in the Hierarchy doc. The simple or single-parent model is given in [§ 2.2].
      • The Root level (not the Anchor) is the Potential DataSource
      • The Leaf level is that which contains data, either a Record or a Struct (for those in the collective that allow one).
        • In the Potential Datasource, it is representative, truly a RecordType and FieldType
        • In the Actual DataSource, it is an actual Record, which is an instance of RecordType, and actual Field, which is a narrower definition of FieldType.
    2. Method/Struct
      In order to handle a Struct, which in definition terms is identical to a Record, and to allow a Struct to contain a Struct, we need a level of abstraction, which is ...

      • Article
        is either

        • a Field, which is the atomic unit of storage, xor
        • a Struct, which contains Articles
      • that requires an Exclusive Subtype cluster, fully defined along with SQL implementation details, in the Subtype doc

    3. Method/Array
      To support an Array of Fields:

      • These are multi-valued dependencies on Field, thus implemented as child tables.
        • For scalars the NumElement is 1.
        • That makes the Exclusive Subtype cluster on Field that is otherwise required for scalars redundant.

    2.3 Relational Data Model

    This is the progress after seven iterations.  It shows the Table-Relation level (the Attribute level is too large for an inline graphic).

    foo

    1. Assumption
      That the JS (or whatever) objects are local to the webpage/user.  If your objects are global, the value tables need to be constrained to Connection.

    2. The data model is given in a single PDF:

      • Table Relation level
      • Table Relation level + sample data
      • Table Attribute level + sample data.

    2.4 Notation

    • All my data models are rendered in IDEF1X, available from the early 1980's, the one and only notation for Relational Data Modelling, the Standard since 1993.

    • The IDEF1X Introduction is essential reading for those who are new to Codd's Relational Model, or its modelling method. Note that IDEF1X models complete, they are rich in detail and precision, showing all required details, whereas a home-grown model, being unaware of the imperatives of the Standard, have far less definition. Which means, the notation needs to be fully understood.