I am making a research repository where there are different types of research items such as conferences, publications, patents, keynotes etc. The data will be inserted after getting from the relevant sources, processing it and then inserting in a batch from excel sheet. The main operation will be querying the data according to the logged-in user like researcher related information for an individual, department/unit related information (mainly summing up the rows) for the chairperson and so on. Now when I approach this, I see two options:
Make two tables, one for the research item type and the other for the actual item
Make individual tables for all type of objects
The problem with the 1st structure is that I will a huge main table with empty/null columns. But it will allow me to easily add another research item in future, since I can simply add the new item in the "type" table and then add the actual data in the common table.
However, the second approach allows me to only query the relevant table to get the information, hence no empty/null values. The drawback is I will not be able to add new research item in this structure, and I need to add new table for the new item type.
If I may ask, which of the two strategies would you recommend and why?
The 1st one entails multiple database queries, and the second one entails a large single table.
If it helps, I am using MS SQL server.
The problem you're facing is the resolution of a hierarchy in an ER model. You have a parent entity, or generalization (RESEARCH_ITEM) that can be instantiated in different ways (your child entities, like PUBLICATION, PATENT, and so on).
To implement this hierarchy in the physical layer (i.e. creating the tables) you have to consider which properties this hierarchy has. In particular, you have to ask yourself:
Combining this two criteria, we have four possbile cases:
The resolution of the hierarchy depends upon the scenario. If your hierarchy is a total-disjoint one, the best thing to do would be to eliminate the parent entity and to incorporate its attributes in the child entities (faster queries, cleaner tables). On the other hand, if there is overlapping, this solution is not optimal, because you'd have duplication of data (the same row in two child tables). In this case you could opt for the incorporation of the children in the parent, with possible NULL fields for child-specific attributes.
Moreover, in order to design the better implementation, you'd have to consider how the data are accessed (Is there a child that I know will be queried against very often? In this case a separate table would be good).