Search code examples
hadoopimpala

Impala external table with tab separated values and field names


I have some data in HDFS that I would like to create an external table and query via Impala. The data is tab separated but also contains the field name. Example data:

state:IL     city:chicago     population:2714856
state:NY     city:New York     population:8336697

I know how to create a table and specify the data is tab delimited, but is there a way to handle the fields in the data?


Solution

  • Impala

    The solution in Impala uses the same REGEXP_EXTRACT logic as the Pig example that I posted earier.

    --csp.txt (input file, residing in /user/cloudera/csp)

    state:New York  city:New York   population:8336697
    state:California        city:Los Angeles        population:3857799
    state:Illinois  city:Chicago    population:2714856
    state:Texas     city:Houston    population:2160821
    state:Pennsylvania      city:Philadelphia       population:1547607
    state:Arizona   city:Phoenix    population:1488750
    state:Texas     city:San Antonio        population:1382951
    state:California        city:San Diego  population:1338348
    state:Texas     city:Dallas     population:1241162
    state:California        city:San Jose   population:982765
    state:Texas     city:Austin     population:842592
    



    Create Database and External Table

    CREATE DATABASE IF NOT EXISTS CSP COMMENT 'City, State, Population';
    
    DROP TABLE IF EXISTS CSP.original;
    
    CREATE EXTERNAL TABLE IF NOT EXISTS CSP.original 
    (
        st STRING COMMENT 'State', 
        ct STRING COMMENT 'City', 
        po STRING COMMENT 'Population'
    ) 
    COMMENT 'Original Table' 
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' 
    LOCATION '/user/cloudera/csp';
    



    Select statement to regexp out the "state:", "city:" and "population:" text

    SELECT 
      regexp_extract(st, '.*:(\\w.*)', 1) AS state, 
      regexp_extract(ct, '.*:(\\w.*)', 1) AS city, 
      regexp_extract(po, '.*:(\\w.*)', 1) AS population 
    FROM original;
    



    Query Results

    [localhost.localdomain:21000] > select regexp_extract(st, '.*:(\\w.*)', 1) AS state, regexp_extract(ct, '.*:(\\w.*)', 1) AS city, regexp_extract(po, '.*:(\\w.*)', 1) AS population FROM original limit 11;
    Query: select regexp_extract(st, '.*:(\\w.*)', 1) AS state, regexp_extract(ct, '.*:(\\w.*)', 1) AS city, regexp_extract(po, '.*:(\\w.*)', 1) AS population FROM original limit 11
    +--------------+--------------+------------+
    | state        | city         | population |
    +--------------+--------------+------------+
    | New York     | New York     | 8336697    |
    | California   | Los Angeles  | 3857799    |
    | Illinois     | Chicago      | 2714856    |
    | Texas        | Houston      | 2160821    |
    | Pennsylvania | Philadelphia | 1547607    |
    | Arizona      | Phoenix      | 1488750    |
    | Texas        | San Antonio  | 1382951    |
    | California   | San Diego    | 1338348    |
    | Texas        | Dallas       | 1241162    |
    | California   | San Jose     | 982765     |
    | Texas        | Austin       | 842592     |
    +--------------+--------------+------------+
    Returned 11 row(s) in 0.22s
    





    Pig

    The easiest way for me to conceptualize the process was actually in Pig first, so I mocked up a data file using your syntax, and created the program in Pig. The output of the program is a csv formatted file, which can be used to create the Impala external table, if you like.

    --csp.pig

    REGISTER piggybank.jar
    
    A = LOAD 'csp.txt' USING PigStorage('\t') AS (st:chararray,ct:chararray,po:chararray);
    
    data = FOREACH A GENERATE 
    REGEX_EXTRACT(st, '.*:(\\w.*)', 1) AS (state:chararray),
    REGEX_EXTRACT(ct, '.*:(\\w.*)', 1) AS (city:chararray),
    REGEX_EXTRACT(po, '.*:(\\w.*)', 1) AS (population:int);
    
    STORE data INTO 'csp' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE');
    

    --csp.txt (input)

    state:New York  city:New York   population:8336697
    state:California        city:Los Angeles        population:3857799
    state:Illinois  city:Chicago    population:2714856
    state:Texas     city:Houston    population:2160821
    state:Pennsylvania      city:Philadelphia       population:1547607
    state:Arizona   city:Phoenix    population:1488750
    state:Texas     city:San Antonio        population:1382951
    state:California        city:San Diego  population:1338348
    state:Texas     city:Dallas     population:1241162
    state:California        city:San Jose   population:982765
    state:Texas     city:Austin     population:842592
    

    --csp (output)

    New York,New York,8336697
    California,Los Angeles,3857799
    Illinois,Chicago,2714856
    Texas,Houston,2160821
    Pennsylvania,Philadelphia,1547607
    Arizona,Phoenix,1488750
    Texas,San Antonio,1382951
    California,San Diego,1338348
    Texas,Dallas,1241162
    California,San Jose,982765
    Texas,Austin,842592