Search code examples
postgresqlcsvforeign-data-wrapper

Can this particular INSERT be optimized?


I am importing a csv into a postgresql 9.5.7 database. The thing is though, that the csv is partially malformed (some rows are missing commas and thus entire columns, or some might have too many, or some values are invalid).

So either, I would clean up the csv with an external tool before the import or let the filtering be done by the database itself.

I prefer the second approach since it seems to me to be less dependent on external csv-cleaning scripts and since all the data validation is being taken place directly at the persistence level.

While usually it's impossible to deal with misshaped rows when doing a csv-import, I nevertheless found a way around this issue by:

  1. Including the csv into the database as a foreign table, but text only and with only one text column in which the whole rows including their commas are included.

  2. Insert from that foreign table into a clean target table by splitting the single text column respective their commas.

But the import of a 200 MB csv file with 33 million lines took around 6 hours on my test machine. So surely the Insert statement can be further optimized? I am pretty new to postgres so this is entirely likely. Please correct me where I have made decisions which could be done better.

Now, shortly explaing the domain to be modelled: It is about dealing with sensors whose locations are being recorded by their signal strengths to various stations at particular time intervals. Those intervals are quite precise by being recorded at milliseconds-precision.

So all the commands issued to get this working are as follows.

Create the fdw server:

CREATE EXTENSION file_fdw;
CREATE SERVER csv_import_server FOREIGN DATA WRAPPER file_fdw;

Next, create the foreign csv table, with only one text-column containg all data. A clean row would look like this:

'1465721143588,-83,55,1361'

where the first value is a unix timestamp with milliseconds precision, then the rssi signal strength value, then the station's id where the signal is being picked up, then the sensor's id

CREATE FOREIGN TABLE signals_csv ( 
    value TEXT )
SERVER csv_import_server OPTIONS( 
filename '<path_to_file>/signals.csv', format 'text');

Target table:

CREATE TABLE signals (
    timestamp TIMESTAMP NOT NULL,
    rssi INTEGER NOT NULL,
    stations_id INTEGER NOT NULL,
    distributed_tags_id INTEGER NOT NULL,
    PRIMARY KEY(timestamp, stations_id, distributed_tags_id),
    FOREIGN KEY(stations_id) REFERENCES stations(stations_id),
    FOREIGN KEY(distributed_tags_id) REFERENCES tags(id) );

Now the INSERT:

INSERT INTO signals (timestamp, rssi, stations_id, distributed_tags_id) SELECT
TO_TIMESTAMP( tmp.timestamp::double precision / 1000),
tmp.rssi::INTEGER,
tmp.stations_id::INTEGER,
tmp.distributed_tags_id::INTEGER
FROM ( SELECT 
    SPLIT_PART ( value, ',', 1) AS timestamp, 
    SPLIT_PART ( value, ',', 2) AS rssi, 
    SPLIT_PART ( value, ',', 3) AS stations_id,
    SPLIT_PART ( value, ',', 4) AS distributed_tags_id
        FROM signals_csv ) AS tmp WHERE (
        tmp.timestamp ~ '^[0-9]+$' AND
        tmp.rssi ~ '^-[0-9]+$' AND
        tmp.stations_id ~ '^[0-9]+$' AND
        tmp.distributed_tags_id ~ '^[0-9]+$' AND
        EXISTS ( SELECT 1 FROM tags t WHERE t.id::TEXT = tmp.distributed_tags_id ) AND
        EXISTS ( SELECT 1 FROM stations s WHERE s.stations_id::TEXT = tmp.stations_id )
        )
ON CONFLICT (timestamp, stations_id, distributed_tags_id ) DO NOTHING;

I guess that the bulk performance hits are:

  • the conversions of the unix timestamps into double precision and then their divisions,
  • the regex analysis of the splitted strings.
  • the foreign key look-up check

As I see it though, there is no way around those restrictions if I want to keep the data modelled in a consistent manner while also keeping the milliseconds precisions stored in a human-readable way.

The imported data is clean and consistent though and as for this dimension I am satisifed by my approach; the only downside is its bad performance. So if anyone could give me pointers on how to improve this, I'd be very thankful.

Cheers!


Solution

  • I solved it in a different manner, and could reduce the import time from 7 hours to only 1.

    So, instead of validating the data before the INSERT (in the WHERE clase of my initial post here), I let the INSERT operation itself validate the data (since I have defined the columns' types in the CREATE TABLE anyway).

    Though since the INSERT throws an exception when encountering an unexpected data type, I do the INSERT per row in a loop so that the exception only aborts the current iteration and not the transaction as a whole.

    The working code looks like this:

    CREATE OR REPLACE FUNCTION import_tags_csv( path_to_csv TEXT ) RETURNS VOID AS $$
    
    DECLARE
    
        cursor SCROLL CURSOR FOR SELECT 
            SPLIT_PART ( value, ',', 1) AS id, 
            SPLIT_PART ( value, ',', 2) AS active_from,
            SPLIT_PART ( value, ',', 3) AS active_to
            FROM csv_table;
        i BIGINT := 0;
    
    BEGIN 
    
        -- create the whole foreign data wrapper for integrating the csv:
        CREATE EXTENSION file_fdw;
        CREATE SERVER csv_import_server FOREIGN DATA WRAPPER file_fdw;
        EXECUTE '
            CREATE FOREIGN TABLE csv_table ( value TEXT )
            SERVER csv_import_server OPTIONS( filename ''' || path_to_csv || ''', format ''text'')';
    
        -- Iterating through the rows, converting the text data and inserting it into table tags
        FOR csv_row IN cursor LOOP
        BEGIN
    
            i := i +1;
            INSERT INTO tags ( 
                id, 
                active_from, 
                active_to) 
                VALUES (
                csv_row.id::INTEGER,
                TO_TIMESTAMP( csv_row.active_from::double precision / 1000),
                TO_TIMESTAMP( csv_row.active_to::double precision / 1000) );
    
            --If invalid data is read, the table constraints throw an exception. The faulty line is dismissed
            EXCEPTION WHEN OTHERS THEN 
                RAISE NOTICE E'% \n\t line %: %\n', SQLERRM, i, csv_row;
    
        END;
        END LOOP;
    
        -- Dropping the foreign table which had the csv integrated
        DROP FOREIGN TABLE csv_table;
        DROP SERVER csv_import_server;
        DROP EXTENSION file_fdw;
    
    END;
    $$ LANGUAGE plpgsql;