Search code examples
hadoophivehiveqlhive-queryregexserde

Insert data in hive using multidelimeter


how to insert data in hive using multidelimeter and between the column the delimiter is not specified.

Below is my data :

25380 20130101  2.514 -135.69   58.43     8.3     1.1     4.7     4.9     5.6     0.01 C     1.0    -0.1     0.4    97.3    36.0    69.4 -99.000 -99.000 -99.000 -99.000 -99.000 -9999.0 -9999.0 -9999.0 -9999.0 -9999.0
25380 20130102  2.514 -135.69   58.43     3.0    -0.3     1.4     1.2     0.0     0.35 C     1.3    -1.0    -0.1   100.0    89.5    98.2 -99.000 -99.000 -99.000 -99.000 -99.000 -9999.0 -9999.0 -9999.0 -9999.0 -9999.0

I want to insert only first 7 columns in my hive table and I have a txt file for above data.

create table script :

CREATE TABLE hotcold (a int,b int,c float,d float,e float,f float,g float,h string,i string,j string,k string,l string,m string,n string,o string,p string,q string,
r string,s string,t string,u string,v string,w string,x string,y string,z string,aa string,bb string,cc string,dd string,ee string
) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ('input.regex'='\\s+'); 

My data insert script is below :

LOAD DATA LOCAL INPATH '/home/cloudera/WeatherData.txt' into table hotcold;

Below is my select statement and error:

select * from hotcold;

Error :

Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns

Solution

  • Each column should have corresponding capturing group () in the regexp.

    In the example below first group in the beginning of the string ^(\\d+)-corresponds positive integer,

    \\s+ - one or more whitespace delimiters,

    second capturing group corresponds positive integer (\\d+),

    again \\s+ - one or more whitespace delimiters,

    third capturing group - ([+-]?[0-9.]+) - a float, not very strict format, allowing +- sign and any combination of didits and dots

    and .* at the end allowing any characters at the end of the string, not captured, add all other columns to the regex, my example contains regex for three columns (three capturing groups):

    WITH SERDEPROPERTIES ('input.regex'='^(\\d+)\\s+(\\d+)\\s+([+-]?[0-9.]+).*') 
    

    Read this for more details: Using Regular Expressions to Extract Fields for Hive Tables