Search code examples
sqldatabase-designconceptual

ER Diagram for Government Statistics


I am new to database design and I am trying to practice with available government statistics for a small country. I have found almost 100 tables that store information collected for given years and months from a specific region. Some tables are updated monthly, while others are updated annually. I believe this means that in each table, there will be a natural composite PK made up of the year and month, or simply a PK made up of the year.

ER Diagram ER Diagram

In the above image, each parent attribute of Trip Survey represents one of the many data tables I have collected from public databanks specific to the region being researched (e.g. satisfaction_level, motivation_level, amount_spent all represent different surveys on the same population). Does it make sense to combine all of the tables into one table (e.g. Trip Survey)?

I'm not sure if my relationships are accurate (total and partial participation). My goal is to be able to queries the data to find points of correlation and make predictions about the future. I want to try and connect all of the tables over time.

The surveys collected can cover nearly any topic, but the common thread is they represent a moment in time, either monthly or annually. I want to eventually add a table of significant political events that may reflect outliers from trends.

Example Result: When motivation levels were low in 2018, spending was also down and length of stay was shorter relative to 'n' period.

As a newbie, any and all help is greatly appreciated.

Thank you


Solution

  • Simplify simplify simplify.

    Start with one table, with at least some columns you comprehend. Load it into some dbms (pick one with geospatial capabilities and windowing functions, you may want them later: recent versions of MariaDB, MySQL and PostreSQL are fine choices). Import your table. This can be a pain in the axx neck to get right, but do your best to get it right anyhow.

    Don't worry about primary keys or unique indexes when starting out. You're just exploring the data, not building it. Don't worry about buying or renting a server: most laptops can handle this kind of exploration just fine.

    Pick a client program that keeps a history of the queries you put into it. HeidiSQL is a good choice. The relatively new Datagrip from Jetbrains is worth a look. Avoid Microsoft's SQL Server Management Studio: no history feature. (You'll often want to go back to something you tried a few hours or days ago when you're exploring, so the query history feature is vital.)

    Then fiddle around with queries, especially aggregates ... e.g.

     SELECT COUNT(*), year, origin, destination
       FROM trip
      GROUP BY year, origin, destination;
    

    Look for interesting stuff you can glean from the one table. Get the hang of it. Then add another table that can be JOINed easily to the first table. Repeat your exploration.

    That should get you started. Once you begin to understand your dataset, you can start ranking stuff, working out quintiles, and all that.

    And, when you have to update or augment your data without reloading it, you'll need various primary / unique keys. That's in the future for you.