I'm trying to read a movie dataset I got from Kaggle using Apache Pig. One of the .csv files is named "keywords.csv" and it has tuples like this:
862,[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]
8844,[{'id': 10090, 'name': 'board game'}, {'id': 10941, 'name': 'disappearance'}, {'id': 15101, 'name': "based on children's book"}, {'id': 33467, 'name': 'new home'}, {'id': 158086, 'name': 'recluse'}, {'id': 158091, 'name': 'giant insect'}]
.
.
.
The first field is the id of the movie, and the second field is a JSON like string with keywords related to that movie and their ids. The file separator in all the .csv files of the dataset is a comma, but when it comes to loading the keywords.csv it causes a problem. Here is how I'm trying to load the table:
keywords = load 'dataset/keywords.csv' USING PigStorage(',') as (id:int, keywords:chararray);
fltr = filter keywords by id == 862;
DUMP fltr;
That only prints (862,"[{'id': 931)
when I was expecting it to print something like this:
(862,[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}])
So that I could save the column keywords in a new file with the .json extension and then use the JsonLoader()
to extract the keywords.
How should I go about doing this? Or is it even possible to read keywords directly without having to to save it to a external .json file? Thank You.
Just found out about maps in Apache pig, here is my latest try:
keywords = load 'dataset/keywords.csv' USING PigStorage(',') as (id:int, keywords:[{keyId:int,name:chararray}]);
That throws an error: Syntax error, unexpected symbol at or near 'int'
I think you need to use Twitter's Elephant Bird to parse a single json column in Pig. (If you wanted to parse files that are json-only, you could simply use Pig's JsonLoader API).
Here is a related question - it looks like your json is also an array, so what's written there will apply for you, too.
In case that doesn't work, here's a blog post describing how to write a Python UDF for a more specific case of JSON parsing. You can of course do the same thing with a Java UDF.