Search code examples
hadoopmapreducehiveapache-pig

Better to read from Hive or flat file in Pig?


Couldn't find a straight answer on this anywhere. I'm joining an incoming dataset to several large tables that formerly lived in MySQL tables behind a web service. I dumped the tables to flat CSV files in Hadoop, and I'm using Pig to load the incoming dataset and table files, and to perform the joins.

It's slow going, because there are several table files to join with, and because the files themselves are so large. I'm just going for LEFT OUTER joins on a single field, nothing fancy.

So, my question is, is there any performance benefit to loading the CSV files into Hive tables and using HCatLoader within Pig instead of just loading CSV files? It doesn't seem like Hive provides any benefit besides a SQL-like interface to query tables, which doesn't matter when I'm just joining a dataset to the whole thing.


Solution

  • As per Oreilly's Programming Pig(Page 165):

    Hive provides SQL, it is a better tool for doing traditional data analytics. Most data analysts are already familiar with SQL, and business intelligence tools expect to speak to data sources in SQL. Pig Latin is a better choice when building a data pipeline or doing research on raw data.

    So in simple words, Hive for better querying. PIG for building data pipelines. PIG over Hive for optimized querying.


    PIG over Hive is useful if you are going to do many joins where partitioning maybe utilized.

    If you only intend to do left joins it may not be required. Then again for further querying and analysis Hive would be better.

    Read: difference b/w hive and pig and also read this.