Search code examples
hadoopapache-pig

How PigStorage is used in Hadoop and Why?


I got a quite confused as why we need another storage layer PigStorage on top of Hadoop HDFS when using Pig to process data in Hadoop? And are files stored in PigStorage distributed? Can anyone please help explain?

Thank you.


Solution

  • PigStorage is not storage. It is not stored anywhere ; it reads and loads plaintext files.

    LOAD into Avro or ORC would almost always be better

    enter image description here

    It is just metadata of an alias and schema over the filesystem data (filesystem can be more than only HDFS)

    A = LOAD '/path/file.txt' USING PigStorage()  // read plaintext from filesystem
    B = // do something with A
    LOAD B into '/path_orc' USING OrcStorage()  // store ORC back on same filesystem