Search code examples
vertica

How Vertica handles semi-structured data even if loaded from different file formats


My understanding about semi-structured data handling in Vertica is that if data is say like this (in json)

{
"f1":1,
"f2":"hello",
"f3":false,
"f4":2
}

then a flextable is created with two columns __identity__ and __raw__. __identify__ will have 4 fields (I suppose integers 1,2,3,4) and __raw__ will be raw representation of data (1, hello,false and 2).

I can also load data in a csv file in the same flextable eg 2, hello2, true, 3. How does Vertica decide which field maps to which column (eg. both f1 and f4) are int.


Solution

  • Well, nothing beats having a Vertica SQL prompt ready (and the privilege to create a database object ...) to try and find out.

    With JSON, the field names are in the structure: key-value pairs.

    With CSV, the first line of the data file needs to have the column names - which I add below ...

    -- connecting with VSQL, 
    $ vsql -h localhost -d sbx -U dbadmin -w pwd
    $ vsql -h localhost -d sbx -U dbadmin -w pwd
    Welcome to vsql, the Vertica Analytic Database interactive terminal.
    
    Type:  \h or \? for help with vsql commands
           \g or terminate with semicolon to execute query
           \q to quit
    
    sbx=> -- create the flex table
    sbx=> CREATE FLEX TABLE flx();
    CREATE TABLE
    sbx=> -- load the flex table from stdin - data handed in-line - using your input
    sbx=> COPY flx FROM stdin PARSER fjsonparser();
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> {                                        
    >> "f1":1,
    >> "f2":"hello",
    >> "f3":false,
    >> "f4":2
    >> }
    >> \.
    -- test the load ...
    sbx=> SELECT f1,f2,f3,f4 FROM flx;
     f1 |  f2   |  f3   | f4 
    ----+-------+-------+----
     1  | hello | false | 2
    sbx=>-- load the CSV file - note that we need the title line, 
    sbx=>-- which I add, to have same values in the same fields
    sbx=> COPY flx FROM stdin PARSER fcsvparser();
    Enter data to be copied followed by a newline.
    End with a backslash and a period on a line by itself.
    >> f1,f2,f3,f4
    >> 2, hello2, true, 3
    >> \.
    sbx=>-- check the contents now
    sbx=> SELECT f1,f2,f3,f4 FROM flx;
     f1 |   f2   |  f3   | f4 
    ----+--------+-------+----
     1  | hello  | false | 2
     2  | hello2 | true  | 3
    sbx=>-- resulting table definition in catalog ...
    sbx=> \d flx
                                                 List of Fields by Tables
     Schema  | Table |    Column    |          Type          |  Size  | Default | Not Null | Primary Key | Foreign Key 
    ---------+-------+--------------+------------------------+--------+---------+----------+-------------+-------------
     dbadmin | flx   | __identity__ | int                    |      8 |         | t        | f           | 
     dbadmin | flx   | __raw__      | long varbinary(130000) | 130000 |         | t        | f           | 
    (2 rows)
    
    sbx=> -- check the contents of __identity__ and (after visualising) __raw__
    sbx=> SELECT __identity__,REPLACE(MAPTOSTRING(__raw__),CHR(10),' ') FROM flx;
     __identity__ |                                REPLACE                                 
    --------------+------------------------------------------------------------------------
                1 | {     "f1": "1",     "f2": "hello",     "f3": "false",     "f4": "2" }
                2 | {     "f1": "2",     "f2": "hello2",     "f3": "true",     "f4": "3" }