suppose we have such job:
class MRjob(JobTask):
def output(self):
return ...
def requires(self):
return ...
def mapper(self, line):
# some line process
yield key, (...information, stored in hashable type...)
def reducer(self,key,values):
# some reduce logic... for example this
unique = set(values)
for elem in unique:
yield key, elem[0], elem[1]
What should I do inside the output method to insert data to existing table partition (also table is stored in orc format)? I'd like to skip process of converting data to orc, hence I tried to
return HivePartitionTarget(self.insert_table, database=self.database_name, partition=partition)
but this didn't work. I also found that luigi tries to pass output to some file. With HivePartitionTarget luigi returns error like 'object has no attribute write', so my assumption is that HivePartitionTarget just doesn't contain write method. Thus I think I'm doing something wrong and should use another method but didn't managed to find a single example
I do not much idea on how this can be achieved in luigi
. What I may suggest is to simple approach to write the output for luigi
script in normal delimited format (say comma delimited format).
Create an external hive table on top of that:
CREATE EXTERNAL TABLE temp_table(
<col_name> <col_type>,
<col_name2> <col_type>
.......
.......
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
LOCATION ‘ /hive/data/weatherext’;
Insert the data into original table (having ORC format data) using simple hive insert-into-select
query.
INSERT INTO TABLE target_table
PARTITION( xxx )
SELECT
COL_NAME1,
COL_NAME2
FROM temp_table;
Your target table would have data in ORC format and hive would handle the conversion for you.
For detail syntax, refer https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries