Search code examples
postgresqlcsvschemapostgresql-copy

How to generate a schema from a CSV for a PostgreSQL Copy


Given a CSV with several dozen or more columns, how can a 'schema' be created that can be used in a CREATE TABLE SQL expression in PostgreSQL for use with the COPY tool?

I see plenty of examples for the COPY tool, and basic CREATE TABLE expressions, but nothing goes into detail about cases when you have a potentially prohibitive number of columns for manual creation of the schema.


Solution

  • If the CSV is not excessively large and available on your local machine then csvkit is the simplest solution. It also contains a number of other utilities for working with CSVs, so it is a usefull tool to know in general.

    At its simplest typing into the shell:

    $ csvsql myfile.csv
    

    will print out the required CREATE TABLE SQL command, which can be saved to a file using output redirection.

    If you also provide a connection string csvsql will create the table and upload the file in one go:

    $ csvsql --db "$MY_DB_URI" --insert myfile.csv
    

    There are also options to specify the flavor of SQL and CSV you are working with. They are documented in the builtin help:

    $ csvsql -h
    usage: csvsql [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u {0,1,2,3}] [-b]
                  [-p ESCAPECHAR] [-z MAXFIELDSIZE] [-e ENCODING] [-S] [-H] [-v]
                  [--zero] [-y SNIFFLIMIT]
                  [-i {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}]
                  [--db CONNECTION_STRING] [--query QUERY] [--insert]
                  [--tables TABLE_NAMES] [--no-constraints] [--no-create]
                  [--blanks] [--no-inference] [--db-schema DB_SCHEMA]
                  [FILE [FILE ...]]
    
    Generate SQL statements for one or more CSV files, create execute those
    statements directly on a database, and execute one or more SQL queries.
    positional arguments:
      FILE                  The CSV file(s) to operate on. If omitted, will accept
                            input on STDIN.
    
    optional arguments:
      -h, --help            show this help message and exit
      -d DELIMITER, --delimiter DELIMITER
                            Delimiting character of the input CSV file.
      -t, --tabs            Specifies that the input CSV file is delimited with
                            tabs. Overrides "-d".
      -q QUOTECHAR, --quotechar QUOTECHAR
                            Character used to quote strings in the input CSV file.
      -u {0,1,2,3}, --quoting {0,1,2,3}
                            Quoting style used in the input CSV file. 0 = Quote
                            Minimal, 1 = Quote All, 2 = Quote Non-numeric, 3 =
                            Quote None.
      -b, --doublequote     Whether or not double quotes are doubled in the input
                            CSV file.
      -p ESCAPECHAR, --escapechar ESCAPECHAR
                            Character used to escape the delimiter if --quoting 3
                            ("Quote None") is specified and to escape the
                            QUOTECHAR if --doublequote is not specified.
      -z MAXFIELDSIZE, --maxfieldsize MAXFIELDSIZE
                            Maximum length of a single field in the input CSV
                            file.
      -e ENCODING, --encoding ENCODING
                            Specify the encoding the input CSV file.
      -S, --skipinitialspace
                            Ignore whitespace immediately following the delimiter.
      -H, --no-header-row   Specifies that the input CSV file has no header row.
                            Will create default headers.
      -v, --verbose         Print detailed tracebacks when errors occur.
      --zero                When interpreting or displaying column numbers, use
                            zero-based numbering instead of the default 1-based
                            numbering.
      -y SNIFFLIMIT, --snifflimit SNIFFLIMIT
                            Limit CSV dialect sniffing to the specified number of
                            bytes. Specify "0" to disable sniffing entirely.
      -i {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}, --dialect {access,sybase,sqlite,informix,firebird,mysql,oracle,maxdb,postgresql,mssql}
                            Dialect of SQL to generate. Only valid when --db is
                            not specified.
      --db CONNECTION_STRING
                            If present, a sqlalchemy connection string to use to
                            directly execute generated SQL on a database.
      --query QUERY         Execute one or more SQL queries delimited by ";" and
                            output the result of the last query as CSV.
      --insert              In addition to creating the table, also insert the
                            data into the table. Only valid when --db is
                            specified.
      --tables TABLE_NAMES  Specify one or more names for the tables to be
                            created. If omitted, the filename (minus extension) or
                            "stdin" will be used.
      --no-constraints      Generate a schema without length limits or null
                            checks. Useful when sampling big tables.
      --no-create           Skip creating a table. Only valid when --insert is
                            specified.
      --blanks              Do not coerce empty strings to NULL values.
      --no-inference        Disable type inference when parsing the input.
      --db-schema DB_SCHEMA
                            Optional name of database schema to create table(s)
                            in.
    

    Several other tools also do schema inference including:

    • Apache Spark
    • Pandas (Python)
    • Blaze (Python)
    • read.csv + your favorite db package in R

    Each of these have functionality to read a CSV (and other formats) into a tabular data structure usually called a DataFrame or similar, inferring the column types in the process. They then have other commands to either write out an equivalent SQL schema or upload the DataFrame directly into a specified database. The choice of tool will depend on the volume of data, how it is stored, idiosyncrasies of your CSV, the target database and the language you prefer to work in.