Search code examples
hivespecial-charactershiveql

Hive columns-newline


Some of the columns in the hive has multiple lines of values which comes as newline

for example

table preview

Empid Empname Dept company year month day
1234 ASD Finance qqq null null null
2015 6 3

But when I query the table with year it gives the correct answer

select year from tbl_name where year='2015'

what could be the reason for these multiline values and how to align these values in a proper column?


Solution

  • Depending on how the table is stored it can be possible to fix or not using SQL.

    If table is based on text file (STORED AS TEXTFILE or using OpenCSVSerDe or JSON...) then rows being read by SerDe using newlines as delimiter and if column contain newline it is being split by newline on the lowest level.

    If the table storage is binary format like ORC, it is not stored as rows delimited by newline. Values with newlines being read without splitting rows but newlines causing split of rows on output, the same happens if storage format is JSON and it contains combinations slash + n (\n), such combinations are being interpreted as newlines on output. It is possible to replace newlines with spaces or empty string using regexp_replace:

    insert overwritre table tbl_name 
    select
    Empid, 
    Empname, 
    Dept, 
     regexp_replace(company, '\\n',' ')  company, --replace newline with space
    `year`, 
    `month`, 
    `day`
    from tbl_name ;
    

    Also if column contains TABs it is also better to replace with spaces or remove them because \t causing columns shift. Use regexp_replace(col_name, '\\t',' ')