Search code examples
csvhiveclouderabeeline

Beeline splits data row on csv export


My csv output file has few rows that are split into two cells because data in Hive table has string entries with ; symbol which causes the split.


PROPER ROW
'xxx-xxxxxx','value','NULL','value','v','value','value','1.0','value','0.0','value',value,'value','value'

ODD ROW
'xxx-xxxxx','value','NULL','value','v','value','value','value','value','value','value','value','VERY LONG NAME; 

SECOND CELL OF ODD ROW
 ;VERY LONG NAME CONTINUED','VERY LONG NAME '

I tried to change output format and field size but not success. Row still gets split into two cells.

beeline -u jdbc:hive2://host:10000/default -n user -p xxxxx --silent=true --outputformat=csv -f sql_code.sql > output.csv

I want my rows not being split into multiple cells. This can be done by escaping ; symbol

'xxx-xxxxxx','value','NULL','value','v','value','value','1.0','value','0.0','value',value,'value','value'

'xxx-xxxxx','value','NULL','value','v','value','value','value','value','value','value','value','VERY LONG NAME VERY LONG NAME CONTINUED','value'


Solution

  • You can use the option "--outputformat=dsv".By default, the delimiter is a pipe (|). You can change it using the option --delimiterForDSV=','.

    The next example works:

    create table temp.test_name           (                                                                                                                                                                                     
    first_name  string
    ,last_name  string
    ,zipcode   string
    )
    partitioned by (part string)
    stored as parquet location '../temp.db/test_name' tblproperties("parquet.compression=SNAPPY")
    ;
    
    insert into temp.test_name partition(part='A') values
    ('David','David','00')
    ,('Ellen', 'Ellen','00')
    ,('David','David','00')
    ,('David', 'VERY LONG NAME; VERY LONG NAME CONTINUED','00');
    
    

    Beeline command that returns 4 rows, the same as the contained in the table:

    beeline --color=true -u "jdbc..." --outputformat=dsv --delimiterForDSV=',' -f sql_code.hql > output.csv
    

    For more information, check next documentation: https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-Separated-ValueOutputFormats