Search code examples
databasestar-schema

How to efficiently utilize 10+ computers to import data


We have flat files (CSV) with >200,000,000 rows, which we import into a star schema with 23 dimension tables. The biggest dimension table has 3 million rows. At the moment we run the importing process on a single computer and it takes around 15 hours. As this is too long time, we want to utilize something like 40 computers to do the importing.

My question

How can we efficiently utilize the 40 computers to do the importing. The main worry is that there will be a lot of time spent replicating the dimension tables across all the nodes as they need to be identical on all nodes. This could mean that if we utilized 1000 servers to do the importing in the future, it might actually be slower than utilize a single one, due to the extensive network communication and coordination between the servers.

Does anyone have suggestion?

EDIT:

The following is a simplification of the CSV files:

"avalue";"anothervalue"
"bvalue";"evenanothervalue"
"avalue";"evenanothervalue"
"avalue";"evenanothervalue" 
"bvalue";"evenanothervalue"
"avalue";"anothervalue"

After importing, the tables look like this:

dimension_table1

id  name
1   "avalue"
2   "bvalue"

dimension_table2

id  name
1   "anothervalue"
2   "evenanothervalue"

Fact table

  dimension_table1_ID       dimension_table2_ID
    1                      1
    2                      2
    1                       2
    1                       2              
    2                       2
    1                       1

Solution

  • Loading CSV data into a database is slow because it needs to read, split and validate the data.

    So what you should try is this:

    1. Setup a local database on each computer. This will get rid of the network latency.

    2. Load a different part of the data on each computer. Try to give each computer the same chunk. If that isn't easy for some reason, give each computer, say, 10'000 rows. When they are done, give them the next chunk.

    3. Dump the data with the DB tools

    4. Load all dumps into a single DB

    Make sure that your loader tool can import data into a table which already contains data. If you can't do this, check your DB documentation for "remote table". A lot of databases allow to make a table from another DB server visible locally.

    That allows you to run commands like insert into TABLE (....) select .... from REMOTE_SERVER.TABLE

    If you need primary keys (and you should), you will also have the problem to assign PKs during the import into the local DBs. I suggest to add the PKs to the CSV file.

    [EDIT] After checking with your edits, here is what you should try:

    1. Write a small program which extract the unique values in the first and second column of the CSV file. That could be a simple script like:

       cut -d";" -f1 | sort -u | nawk ' { print FNR";"$0 }'
      

      This is a pretty cheap process (a couple of minutes even for huge files). It gives you ID-value files.

    2. Write a program which reads the new ID-value files, caches them in memory and then reads the huge CSV files and replaces the values with the IDs.

      If the ID-value files are too big, just do this step for the small files and load the huge ones into all 40 per-machine DBs.

    3. Split the huge file into 40 chunks and load each of them on each machine.

      If you had huge ID-value files, you can use the tables created on each machine to replace all the values that remained.

    4. Use backup/restore or remote tables to merge the results.

      Or, even better, keep the data on the 40 machines and use algorithms from parallel computing to split the work and merge the results. That's how Google can create search results from billions of web pages in a few milliseconds.

    See here for an introduction.