Search code examples
pythonhadoopmapreduceetlluigi

How to write output to partitioned table with orc format with luigi?


suppose we have such job:

class MRjob(JobTask):
  def output(self):
    return ...

  def requires(self):
    return ...

  def mapper(self, line):
    # some line process
    yield key, (...information, stored in hashable type...)

  def reducer(self,key,values):
    # some reduce logic... for example this
    unique = set(values)
    for elem in unique:
      yield key, elem[0], elem[1] 

What should I do inside the output method to insert data to existing table partition (also table is stored in orc format)? I'd like to skip process of converting data to orc, hence I tried to

return HivePartitionTarget(self.insert_table, database=self.database_name, partition=partition)

but this didn't work. I also found that luigi tries to pass output to some file. With HivePartitionTarget luigi returns error like 'object has no attribute write', so my assumption is that HivePartitionTarget just doesn't contain write method. Thus I think I'm doing something wrong and should use another method but didn't managed to find a single example


Solution

  • I do not much idea on how this can be achieved in luigi. What I may suggest is to simple approach to write the output for luigi script in normal delimited format (say comma delimited format).

    Create an external hive table on top of that:

    CREATE EXTERNAL TABLE temp_table(
    <col_name> <col_type>, 
    <col_name2> <col_type>
    .......
    ....... 
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ‘,’
    LOCATION ‘ /hive/data/weatherext’;
    

    Insert the data into original table (having ORC format data) using simple hive insert-into-select query.

    INSERT INTO TABLE target_table
    PARTITION( xxx )
    SELECT 
    COL_NAME1,
    COL_NAME2
    FROM temp_table;
    

    Your target table would have data in ORC format and hive would handle the conversion for you.

    For detail syntax, refer https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries