Imagine a table with the following structure on PostgreSQL 9.0:
create table raw_fact_table (text varchar(1000));
For the sake of simplification I only mention one text column, in reality it has a dozen. This table has 10 billion rows and each column has lots of duplicates. The table is created from a flat file (csv) using COPY FROM.
To increase performance I want to convert to the following star schema structure:
create table dimension_table (id int, text varchar(1000));
The fact table would then be replaced with a fact table like the following:
create table fact_table (dimension_table_id int);
My current method is to essentially run the following query to create the dimension table:
Create table dimension_table (id int, text varchar(1000), primary key(id));
then to create fill the dimension table I use:
insert into dimension_table (select null, text from raw_fact_table group by text);
Afterwards I need to run the following query:
select id into fact_table from dimension inner join raw_fact_table on (dimension.text = raw_fact_table.text);
Just imagine the horrible performance I get by comparing all strings to all other strings several times.
On MySQL I could run a stored procedure during the COPY FROM. This could create a hash of a string and all subsequent string comparison is done on the hash instead of the long raw string. This does not seem to be possible on PostgreSQL, what do I do then?
Sample data would be a CSV file containing something like this (I use quotes also around integers and doubles):
"lots and lots of text";"3";"1";"2.4";"lots of text";"blabla"
"sometext";"30";"10";"1.0";"lots of text";"blabla"
"somemoretext";"30";"10";"1.0";"lots of text";"fooooooo"
Just to questions: - it neccessary to convert your data in 1 or 2 steps? - May we modify the table while converting?
Running more simplier queries may improve your performance (and the server load while doing it)
One approach would be:
Or another (probably better):