Search code examples
sqlpostgresqlpgadmin-4

Is there a way to load text data to database in PostgreSQL?


I want to extract information from a text file (almost 1GB) and store it in PostgreSQL database. Text file is in following format:

DEBUG, 2017-03-23T10:02:27+00:00, ghtorrent-40 -- ghtorrent.rb:Repo EFForg/https-everywhere exists
DEBUG, 2017-03-24T12:06:23+00:00, ghtorrent-49 -- ghtorrent.rb:Repo Shikanime/print exists
...

and I want to extract 'DEBUG', timestamp, 'ghtorrent-40', 'ghtorrent' and "Repo EFForg/https-everywhere exists" from each line and store it in database.

I have done it in using other languages like python (psycopg2) and C++ (libpqxx) but is it possible to write a function in PostgreSQL itself to import the whole data itself.

I am currenly using pgAdmin4 tool for the PostgreSQL. I thinking of using something like pg_read_file in function to read the file but one line at a time and insert it into the table.


Solution

  • An approach I use with my large XML files - 130GB or bigger - is to upload the whole file into a temporary unlogged table and from there I extract the content I want. Unlogged tables are not crash-safe, but are much faster than logged ones, which totally suits the purpose of a temporary table ;-)

    Considering the following table ..

    CREATE UNLOGGED TABLE tmp (raw TEXT);
    

    .. you can import this 1GB file using a single psql line from your console (unix)..

    $ cat 1gb_file.txt | psql -d db -c "COPY tmp FROM STDIN" 
    

    After that all you need is to apply your logic to query and extract the information you want. Depending on the size of your table, you can create a second table from a SELECT, e.g.:

    CREATE TABLE t AS
    SELECT 
      trim((string_to_array(raw,','))[1]) AS operation,
      trim((string_to_array(raw,','))[2])::timestamp AS tmst,
      trim((string_to_array(raw,','))[3]) AS txt
    FROM tmp
    WHERE raw LIKE '%DEBUG%' AND
          raw LIKE '%ghtorrent-40%' AND 
          raw LIKE '%Repo EFForg/https-everywhere exists%'
    

    Adjust the string_to_array function and the WHERE clause to your logic! Optionally you can replace these multiple LIKE operations to a single SIMILAR TO.

    .. and your data would be ready to be played with:

    SELECT * FROM t;
    
     operation |        tmst         |                               txt                                
    -----------+---------------------+------------------------------------------------------------------
     DEBUG    | 2017-03-23 10:02:27 | ghtorrent-40 -- ghtorrent.rb:Repo EFForg/https-everywhere exists
    (1 Zeile)
    

    Once your data is extracted you can DROP TABLE tmp; to free some disk space ;)

    Further reading: COPY, PostgreSQL array functions and pattern matching