Search code examples
hadoophiveuser-defined-functionshiveqlambari

UDF python in Hive with File Json


I have two problem with Hive view in Ambari.

1. Problem 1:I have script :

<br>DELETE FILE /user/admin/hive/scripts/MKT_UDF/fb_audit_ads_creatives.py;
<br>ADD FILE /user/admin/hive/scripts/MKT_UDF/fb_audit_ads_creatives.py;
<br>SELECT TRANSFORM (line) USING 'python fb_audit_ads_creatives.py' as (ad_id) FROM stg_fb_audit_ads_creatives_json where date_time='2018-05-05';

I run it many times, It's run smoothly. But on some times, I get error:

You can see it: Here's Log error
I think it's due config (Hive, Ambari, ...time out...v.v.)

Problem 2: I have file json:

{
  "body": "https://www.facebook.com/groupkiemhieptinhduyen #truongsinhquyet #sapramat #tsq #game3d",
  "thumbnail_url": "https://external.xx.fbcdn.net/safe_image.php?d=AQDU01asRxdnCObW&w=64&h=64&url=https%3A%2F%2Fscontent.xx.fbcdn.net%2Fv%2Ft15.0-10%",
  "campaign_id": "23842841688740666"
}

I use script HQL above and UDF Python:

for line in sys.stdin:
data = json.loads(line)
print (data)
print(data['thumbnail_url']


I run it's okay.

But with UDF Python:

for line in sys.stdin:
data = json.loads(line)
print (data)
print(data['body']


I get error: Log error


Can you help me?


Solution

  • Instead of working with python, I recommend to try this UDTF that allows working on json columns within hive It is then possible to manipulate large json and fetch needed data in a distributed and optimized way.