Search code examples
postgresqlnormalizationetlstar-schemafact-table

PostgreSQL: Loading data into Star Schema efficiently


Imagine a table with the following structure on PostgreSQL 9.0:

create table raw_fact_table (text varchar(1000));

For the sake of simplification I only mention one text column, in reality it has a dozen. This table has 10 billion rows and each column has lots of duplicates. The table is created from a flat file (csv) using COPY FROM.

To increase performance I want to convert to the following star schema structure:

create table dimension_table (id int, text varchar(1000));

The fact table would then be replaced with a fact table like the following:

create table fact_table (dimension_table_id int);

My current method is to essentially run the following query to create the dimension table:

Create table dimension_table (id int, text varchar(1000), primary key(id));

then to create fill the dimension table I use:

insert into dimension_table (select null, text from raw_fact_table group by text);

Afterwards I need to run the following query:

select id into fact_table from dimension inner join raw_fact_table on (dimension.text = raw_fact_table.text);

Just imagine the horrible performance I get by comparing all strings to all other strings several times.

On MySQL I could run a stored procedure during the COPY FROM. This could create a hash of a string and all subsequent string comparison is done on the hash instead of the long raw string. This does not seem to be possible on PostgreSQL, what do I do then?

Sample data would be a CSV file containing something like this (I use quotes also around integers and doubles):

"lots and lots of text";"3";"1";"2.4";"lots of text";"blabla"
"sometext";"30";"10";"1.0";"lots of text";"blabla"
"somemoretext";"30";"10";"1.0";"lots of text";"fooooooo"

Solution

  • Just to questions: - it neccessary to convert your data in 1 or 2 steps? - May we modify the table while converting?

    Running more simplier queries may improve your performance (and the server load while doing it)

    One approach would be:

    1. generate dimension_table (If i understand it correctly, you don't have performance problems with this) (maybe with an additional temporary boolean field...)
    2. repeat: choose one previously not selected entry from dimension_table, select every rows from raw_fact_table containing it and insert them into fact_table. Mark dimension_table record as done, and next... You can write this as a stored procedure, and it can convert your data in the background, eating minimal resources...

    Or another (probably better):

    1. create fact_table as EVERY record from raw_fact_table AND one dimension_id. (so including dimension_text and dimension_id rows)
    2. create dimension_table
    3. create an after insert trigger for fact_table which:
      • searches for dimension_text in fact_table
      • if not found, creates a new record in dimension_table
      • updates dimension_id to this id
    4. in a simle loop, insert every record from raw_fact_table to fact_table