Search code examples
databasedatabase-normalization

Database Design: Multiple tables vs a single table for research related items


I am making a research repository where there are different types of research items such as conferences, publications, patents, keynotes etc. The data will be inserted after getting from the relevant sources, processing it and then inserting in a batch from excel sheet. The main operation will be querying the data according to the logged-in user like researcher related information for an individual, department/unit related information (mainly summing up the rows) for the chairperson and so on. Now when I approach this, I see two options:

    Make two tables, one for the research item type and the other for the actual item 
    Make individual tables for all type of objects 

The problem with the 1st structure is that I will a huge main table with empty/null columns. But it will allow me to easily add another research item in future, since I can simply add the new item in the "type" table and then add the actual data in the common table.

However, the second approach allows me to only query the relevant table to get the information, hence no empty/null values. The drawback is I will not be able to add new research item in this structure, and I need to add new table for the new item type.

If I may ask, which of the two strategies would you recommend and why?

The 1st one entails multiple database queries, and the second one entails a large single table.

If it helps, I am using MS SQL server.


Solution

  • The problem you're facing is the resolution of a hierarchy in an ER model. You have a parent entity, or generalization (RESEARCH_ITEM) that can be instantiated in different ways (your child entities, like PUBLICATION, PATENT, and so on).

    To implement this hierarchy in the physical layer (i.e. creating the tables) you have to consider which properties this hierarchy has. In particular, you have to ask yourself:

    • Overlap constraint: can an instance of the parent entity belong to more than one child entity?
    • Covering constraint: do the child entities cover all the possbile instances of the parent entity?

    Combining this two criteria, we have four possbile cases:

    1. Total disjoint: the child entities cover all the possibile instances without overlap;
    2. Partial disjoint: the child entities don't cover all the possibile instances and there's no overlap;
    3. Total overlapping: the child entities cover all the possibile instances with potential overlap;
    4. Partial overlapping: the child entities don't cover all the possibile instances and there's possible overlap.

    The resolution of the hierarchy depends upon the scenario. If your hierarchy is a total-disjoint one, the best thing to do would be to eliminate the parent entity and to incorporate its attributes in the child entities (faster queries, cleaner tables). On the other hand, if there is overlapping, this solution is not optimal, because you'd have duplication of data (the same row in two child tables). In this case you could opt for the incorporation of the children in the parent, with possible NULL fields for child-specific attributes.

    Moreover, in order to design the better implementation, you'd have to consider how the data are accessed (Is there a child that I know will be queried against very often? In this case a separate table would be good).