Search code examples
hadoophivemapreduceapache-pig

PIG mapreduce output and HIVE


I have a file called test.txt with the records as below (disregard the dots):

"1" "a" "x" 
"2" "b" "y" 
"3" "c" "z"

(tab as field separator)

My pig script (test.pig):

a=LOAD '/Analytics/warehouse/SF/test.txt' as (fullrecord:chararray);

b=FOREACH a generate REPLACE($0,'\t',',');

STORE b INTO 'hdfs://localhost:9000/Analytics/warehouse/SF/sf.out' USING PigStorage(',');

I run the script with: pig -x mapreduce test.pig

the output:

.../warehouse/SF/sf.out

part-m-0000

And the content is only:

"1"
"2"
"3"

Q1- What happened with the other fields?

Q2- Why the tab characters wasn't changed by ","?

Q3- How can i achieve the next result?

"1","a","x" 
"2","b","y" 
"3","c","z"

Q4- How can i query that result with HIVE?


Solution

  • What happened with the other fields

    LOAD defaults to using tab delimiter. Your GENERATE only grabbed the first column. You need USING PigStorage('\n') to load the whole line. But you can also not do this, and just remove the FOREACH, then STORE with PigStorage(',')

    Why the tab characters wasn't changed

    Related to above, there is nothing to replace when you only have one element.

    How can i query that result with HIVE?

    Use HCatalog, not PigStorage -

    STORE data_alias INTO 'tablename' USING org.apache.hcatalog.pig.HCatStorer();
    

    Then query the table.

    Or you would need to define a Hive External table over the HDFS data.

    You can also skip using Pig altogether; Hive can query files that have tabs - STORED AS TEXT FIELDS DELIMITED BY '\t'