Search code examples
sqlpostgresqlmatrixfile-iopostgresql-copy

Load matrix from file into PostgreSQL table


I have a file universities.txt which looks like this:

Alabama

Air University
Alabama A&M University
Alabama State University
Concordia College-Selma
Faulkner University
Huntingdon College
Jacksonville State University
Judson College
Miles College
Oakwood College
Samford University
Southeastern Bible College
Southern Christian University
Spring Hill College
Stillman College
Talladega College
University of North Alabama
University of South Alabama
University of West Alabama

Alaska

Alaska Bible College
Alaska Pacific University
Sheldon Jackson College
University of Alaska - Anchorage
University of Alaska - Fairbanks
University of Alaska - Southeast

Arizona

American Indian College of the Assemblies of God
Arizona State University
Arizona State University East
Arizona State University West
DeVry University-Phoenix
Embry-Riddle Aeronautical University
Grand Canyon University
Northcentral University
Northern Arizona University

.. and so on, where in this case Alabama, Alaska and Arizona are locations and everything else are universities. What I want to do is load the location into a table called Location and the Universities into a table called University, where the Id of the Location table is a FK to the University table, like this:

CREATE TABLE Location (
Id          SERIAL PRIMARY KEY,
Name        TEXT
);

CREATE TABLE University (
Id          SERIAL PRIMARY KEY,
Location    INTEGER REFERENCES Location (Id) NOT NULL,
Name        TEXT
);

So what I want to do in Postgres is something like this:

for (int i=0 until i = universities.size()  i++){
//each entry in the universities vector is a tuple with the first entry being the country/state
//and the second entry being a vector of the universities as String's
Vector tuple = (Vector)universities.get(i);
//insert into location table
String state = (String)tuple.get(0); 
Vector u = (Vector)tuple.get(1);
for(int j=0; until j =u.size(); j++){
//insert into university table with i as FK to location table

Anyone knows how to do this?


Solution

  • Here is a pure SQL solution.

    Use COPY to import your file into a temporary table and one DML statement with data modifying CTEs (requires PostgreSQL 9.1 or later) to do the rest. Should be fast for both steps:

    Temporary table with a single text column ( dropped automatically at end of session):

    CREATE TEMP TABLE tmp (txt text);
    

    Import data from file:

    COPY tmp FROM '/path/to/file.txt'
    

    If you are doing this from a remote client, use the meta command \copy of psql instead.

    My solution depends on the data format displayed in the question. I.e.: there is an empty row before and after a city. I assume actual empty strings in the import file. Make sure to have a leading row with an empty string before the first city to avoid a special case.

    Rows will be inserted in order. I use that for the following window functions without ordering.

    WITH x AS (
        SELECT txt
              ,row_number() OVER () AS rn
              ,lead(txt) OVER () = '' AND
                lag(txt) OVER () = '' AS city
        FROM   tmp                -- don't remove empty rows just yet
        ), y AS (
        SELECT txt, city
              ,sum(city::int) OVER w AS id
        FROM   x
        WHERE  txt <> ''          -- remove empty rows now
        WINDOW w AS (ORDER BY rn)
        ), l AS (
        INSERT INTO location (id, name)
        SELECT id, txt
        FROM   y
        WHERE  city
        ), u AS (
        INSERT INTO university u (location, name)
        SELECT id, txt
        FROM   y
        WHERE  NOT city
        )
    SELECT setval('location_id_seq', max(id))
    FROM   y;
    

    Voilá.

    • CTE x marks cities based on an empty string value in the rows before and after them.

    • CTE y adds a running sum of cities (id), thereby forming a perfectly valid id for each city and its unis.

    • CTEs l and u do the inserting, which is now easy.

    • The final SELECT sets the next value for the sequence attached to location.id. We have not been using it, so we must set it to the current maximum value or we run into duplicate key errors with future INSERTs to location.