Search code examples
hadoopapache-pigudf

Pig UDF or Pig Latin or both?


In which case should we use pig UDF and in which case should we use Pig Latin

Context: I'm working on a project to rebuild a SQL "logs" database and i have to make the design of the new NoSQL database. I'm learning NoSQL and have few knowledge on Hadoop/Cloudera.

  1. I want to use Pig to Load data
  2. I'm not using Cloudera but might use it

Thanks for your answers.


Solution

  • If you can do it in Pig (or Hive), do it in Pig (or Hive).

    Otherwise, do it in Java MapReduce.

    Benefits of Pig:

    Structured data like CSV is REALLY easy to load and use Not that much slower than Java Not prone to Java-level bugs Easier to read and write No need to compile: easier to maintain, easier to deploy There are a few things you may think you can't do in Pig at first and want to use Java for, but you can do it in Pig once you know more about it:

    You can write user-defined loaders in Java. You are going to write some Java to parse out that complicated data format anyways, so why not do it in a Pig Loader? Nesting map and bag datatypes can model hierarchical data structures pretty well, but you'll probably have to write a ton of UDFs. You can use Java MapReduce in Pig. This allows you to do the hard operation in Pig, but the easier stuff elsewhere. There are a few here, but you get the point. Pig is very customizable, and you'll end up writing less Java in general.

    Basic stuff is easy. We can do things like hierarchical data structures, and custom loading with a bit of effort. Ok, so what's left?

    Exotic uses of partitioners to do something MapReduce isn't intended for. Really nasty data formats or completely unstructured data (video, audio, raw human-readable text) Doing complex operations in the DistributedCache (basic things can be done with JOIN and USING 'replicated') Hopefully others can add things they couldn't do in Pig in the comments.