I am importing a csv into a postgresql 9.5.7 database. The thing is though, that the csv is partially malformed (some rows are missing commas and thus entire columns, or some might have too many, or some values are invalid).
So either, I would clean up the csv with an external tool before the import or let the filtering be done by the database itself.
I prefer the second approach since it seems to me to be less dependent on external csv-cleaning scripts and since all the data validation is being taken place directly at the persistence level.
While usually it's impossible to deal with misshaped rows when doing a csv-import, I nevertheless found a way around this issue by:
Including the csv into the database as a foreign table, but text only and with only one text column in which the whole rows including their commas are included.
Insert from that foreign table into a clean target table by splitting the single text column respective their commas.
But the import of a 200 MB csv file with 33 million lines took around 6 hours on my test machine. So surely the Insert statement can be further optimized? I am pretty new to postgres so this is entirely likely. Please correct me where I have made decisions which could be done better.
Now, shortly explaing the domain to be modelled: It is about dealing with sensors whose locations are being recorded by their signal strengths to various stations at particular time intervals. Those intervals are quite precise by being recorded at milliseconds-precision.
So all the commands issued to get this working are as follows.
Create the fdw server:
CREATE EXTENSION file_fdw;
CREATE SERVER csv_import_server FOREIGN DATA WRAPPER file_fdw;
Next, create the foreign csv table, with only one text-column containg all data. A clean row would look like this:
'1465721143588,-83,55,1361'
where the first value is a unix timestamp with milliseconds precision, then the rssi signal strength value, then the station's id where the signal is being picked up, then the sensor's id
CREATE FOREIGN TABLE signals_csv (
value TEXT )
SERVER csv_import_server OPTIONS(
filename '<path_to_file>/signals.csv', format 'text');
Target table:
CREATE TABLE signals (
timestamp TIMESTAMP NOT NULL,
rssi INTEGER NOT NULL,
stations_id INTEGER NOT NULL,
distributed_tags_id INTEGER NOT NULL,
PRIMARY KEY(timestamp, stations_id, distributed_tags_id),
FOREIGN KEY(stations_id) REFERENCES stations(stations_id),
FOREIGN KEY(distributed_tags_id) REFERENCES tags(id) );
Now the INSERT:
INSERT INTO signals (timestamp, rssi, stations_id, distributed_tags_id) SELECT
TO_TIMESTAMP( tmp.timestamp::double precision / 1000),
tmp.rssi::INTEGER,
tmp.stations_id::INTEGER,
tmp.distributed_tags_id::INTEGER
FROM ( SELECT
SPLIT_PART ( value, ',', 1) AS timestamp,
SPLIT_PART ( value, ',', 2) AS rssi,
SPLIT_PART ( value, ',', 3) AS stations_id,
SPLIT_PART ( value, ',', 4) AS distributed_tags_id
FROM signals_csv ) AS tmp WHERE (
tmp.timestamp ~ '^[0-9]+$' AND
tmp.rssi ~ '^-[0-9]+$' AND
tmp.stations_id ~ '^[0-9]+$' AND
tmp.distributed_tags_id ~ '^[0-9]+$' AND
EXISTS ( SELECT 1 FROM tags t WHERE t.id::TEXT = tmp.distributed_tags_id ) AND
EXISTS ( SELECT 1 FROM stations s WHERE s.stations_id::TEXT = tmp.stations_id )
)
ON CONFLICT (timestamp, stations_id, distributed_tags_id ) DO NOTHING;
I guess that the bulk performance hits are:
As I see it though, there is no way around those restrictions if I want to keep the data modelled in a consistent manner while also keeping the milliseconds precisions stored in a human-readable way.
The imported data is clean and consistent though and as for this dimension I am satisifed by my approach; the only downside is its bad performance. So if anyone could give me pointers on how to improve this, I'd be very thankful.
Cheers!
I solved it in a different manner, and could reduce the import time from 7 hours to only 1.
So, instead of validating the data before the INSERT (in the WHERE clase of my initial post here), I let the INSERT operation itself validate the data (since I have defined the columns' types in the CREATE TABLE anyway).
Though since the INSERT throws an exception when encountering an unexpected data type, I do the INSERT per row in a loop so that the exception only aborts the current iteration and not the transaction as a whole.
The working code looks like this:
CREATE OR REPLACE FUNCTION import_tags_csv( path_to_csv TEXT ) RETURNS VOID AS $$
DECLARE
cursor SCROLL CURSOR FOR SELECT
SPLIT_PART ( value, ',', 1) AS id,
SPLIT_PART ( value, ',', 2) AS active_from,
SPLIT_PART ( value, ',', 3) AS active_to
FROM csv_table;
i BIGINT := 0;
BEGIN
-- create the whole foreign data wrapper for integrating the csv:
CREATE EXTENSION file_fdw;
CREATE SERVER csv_import_server FOREIGN DATA WRAPPER file_fdw;
EXECUTE '
CREATE FOREIGN TABLE csv_table ( value TEXT )
SERVER csv_import_server OPTIONS( filename ''' || path_to_csv || ''', format ''text'')';
-- Iterating through the rows, converting the text data and inserting it into table tags
FOR csv_row IN cursor LOOP
BEGIN
i := i +1;
INSERT INTO tags (
id,
active_from,
active_to)
VALUES (
csv_row.id::INTEGER,
TO_TIMESTAMP( csv_row.active_from::double precision / 1000),
TO_TIMESTAMP( csv_row.active_to::double precision / 1000) );
--If invalid data is read, the table constraints throw an exception. The faulty line is dismissed
EXCEPTION WHEN OTHERS THEN
RAISE NOTICE E'% \n\t line %: %\n', SQLERRM, i, csv_row;
END;
END LOOP;
-- Dropping the foreign table which had the csv integrated
DROP FOREIGN TABLE csv_table;
DROP SERVER csv_import_server;
DROP EXTENSION file_fdw;
END;
$$ LANGUAGE plpgsql;