Search code examples
hadoopapache-pig

How can I add a header row to files created from Pig (Hadoop)?


I'm writing a pig latin script similar to the following:

A = load 'data' using PigStorage('\t');
store A into my_data using PigStorage();

This outputs

(Bob, 10, 4.0)
(Jim, 11, 3.25)
(Paul, 9, 2.75)

I'd like to add a first header row to each file stored in HDFS

(Name, Age, GPA)
(Bob, 10, 4.0)
(Jim, 11, 3.25)
(Paul, 9, 2.75)

Any ideas?


Solution

  • This doesn't really make sense for Pig. Each line is a separate record of data, and so unless there is really a person named Name, with an age of Age, and a GPA of GPA, having such a line is wrong. Also, Pig makes no guarantees about the order in which fields will be output (unless using ORDER BY), so your header row might show up anywhere.

    What you are asking for is a way to keep your schema around after Pig is done with its work, so that you don't have to remember what it is or look it up somewhere. Starting with Pig 0.10, this has been possible with PigStorage by storing the schema of the relation as a JSON file .pig_schema, in the same directory as the output. See this page for more detailed information about what that is and how to use it.