I want to extract information from a text file (almost 1GB) and store it in PostgreSQL database. Text file is in following format:
DEBUG, 2017-03-23T10:02:27+00:00, ghtorrent-40 -- ghtorrent.rb:Repo EFForg/https-everywhere exists
DEBUG, 2017-03-24T12:06:23+00:00, ghtorrent-49 -- ghtorrent.rb:Repo Shikanime/print exists
...
and I want to extract 'DEBUG', timestamp, 'ghtorrent-40', 'ghtorrent' and "Repo EFForg/https-everywhere exists" from each line and store it in database.
I have done it in using other languages like python (psycopg2) and C++ (libpqxx) but is it possible to write a function in PostgreSQL itself to import the whole data itself.
I am currenly using pgAdmin4 tool for the PostgreSQL. I thinking of using something like pg_read_file in function to read the file but one line at a time and insert it into the table.
An approach I use with my large XML files - 130GB or bigger - is to upload the whole file into a temporary unlogged table and from there I extract the content I want. Unlogged tables
are not crash-safe, but are much faster than logged ones, which totally suits the purpose of a temporary table ;-)
Considering the following table ..
CREATE UNLOGGED TABLE tmp (raw TEXT);
.. you can import this 1GB file using a single psql
line from your console (unix)..
$ cat 1gb_file.txt | psql -d db -c "COPY tmp FROM STDIN"
After that all you need is to apply your logic to query and extract the information you want. Depending on the size of your table, you can create a second table from a SELECT
, e.g.:
CREATE TABLE t AS
SELECT
trim((string_to_array(raw,','))[1]) AS operation,
trim((string_to_array(raw,','))[2])::timestamp AS tmst,
trim((string_to_array(raw,','))[3]) AS txt
FROM tmp
WHERE raw LIKE '%DEBUG%' AND
raw LIKE '%ghtorrent-40%' AND
raw LIKE '%Repo EFForg/https-everywhere exists%'
Adjust the string_to_array
function and the WHERE
clause to your logic! Optionally you can replace these multiple LIKE
operations to a single SIMILAR TO
.
.. and your data would be ready to be played with:
SELECT * FROM t;
operation | tmst | txt
-----------+---------------------+------------------------------------------------------------------
DEBUG | 2017-03-23 10:02:27 | ghtorrent-40 -- ghtorrent.rb:Repo EFForg/https-everywhere exists
(1 Zeile)
Once your data is extracted you can DROP TABLE tmp;
to free some disk space ;)
Further reading: COPY
, PostgreSQL array functions
and pattern matching