Search code examples
databasecsvrelational

How do you rebild a relational database from .csv files?


So I've got a little over 1.5TB of data, which I have been told came from sort sort of relational database. Unfortunately, somewhere along the line, everything that makes a relational database relational was stripped out and what we're left with is almost fifty .csv files, each one corresponding to one of the tables in the database. There's no schema, no notes, and only one piece of helpful documentation. Compounding the problem is that we don't have access to the people who maintain the database and can't go and ask them what X_ATTRIB_14 in CX_CUST_O means.

We DO have a file that lists, for each attribute, name and datatype, and this covers all tables, so we at least know what the attributes are. It looks something like

TABLE_NAME

ROW_ID.....................VARCHAR2(32)

CREATED....................VARCHAR2(16)

LAST_MODIFIED.........DATE

etc etc etc

The data is very patchy, with many attributes without data listed in any of the fields, and lots of unhelpful names like ATTRIB_3. Each table does have a ROW_ID, CREATED, and CREATED_BY, as well as a CONFLICT_ID. Then, there are the data fields, which are usually patchy and only partially filled as well, with lots of empty values.

So far I've done some basic pre-processing by eliminating attributes which have no values and checking to see if there are simple naming conventions that would give us a hint - for example, if X_ATTRIB_3 in TABLE_A is the same as X_ATTRIB_3 in TABLE_B, and it appears that the only common attributes are those that are trivial - LONGITUDE, for example, is common across a few tables, but that is unlikely to help. I'm unsure as how to proceed; the sheer size of the data makes it impossible to look at everything by hand.

Are there any tools or techniques that have been developed that would aid in reconstructing the relationships between the tables, or is this sufficiently rare that we're going to have to start from scratch?

Thank you for your time.


Solution

  • It is very difficult to rebuild relationships for complex data structures without intimate knowledge about the nature of the data itself. The relationships would have to be rebuilt by hand and in order to do it correctly the data should NOT be modified. In most database schema designs, the designer would create relationships over record IDs and easily indexable data which is typically an arbitrary number. The first thing to do would be to add the content to the database in raw form, AS IS! Then based on your heuristic understanding of data, create relationships that make sense based on queries you intend to conduct on the data. You probably need professional assistance to do this :-) — You are actually better off NOT using any automated build tools to process your raw data; Loss of information can be very subtle when you do datadumps to a format like CSV and then try to rebuild.