Search code examples
hadoopapache-pigcross-product

Pig - trying to avoid CROSS


I'll refer to my previous question . Basically I have these two data sets. And using the venue names I want to output how many times each venue occurred in the tweet messages. The answer that I get was good for small data sets, but imagine I have 10000 venues and 20000 tweet messages using CROSS will give me a relation with 200m records, which is quite a lot.

The simple data set is presented in the previous question and the PIG script that I'm using at the moment is as suggested in the answer. I'm looking for ideas how to do this counting without the CROSS product. Thanks!

REGISTER piggybank.jar
venues = LOAD 'venues_mid' USING org.apache.hcatalog.pig.HCatLoader();
tweets = LOAD 'tweets_mid' USING org.apache.hcatalog.pig.HCatLoader();

tweetsReduced = foreach tweets generate text;
venuesReduced = foreach venues generate name;

/* Create the Cartesian product of venues and tweets */
crossed = CROSS venuesReduced, tweetsReduced;

/* For each record, create a regex like '.*name.*' */
regexes = FOREACH crossed GENERATE *, CONCAT('.*', CONCAT(venuesReduced::name, '.*')) AS regex;


/* Keep tweet-venue pairs where the tweet contains the venue name */
venueMentions = FILTER regexes BY text MATCHES regex;

venueCounts = FOREACH (GROUP venueMentions BY venuesReduced::name) GENERATE group, COUNT($1) as counter;
venueCountsOrdered = order venueCounts by counter;

STORE venueCountsOrdered INTO 'Pig_output/venueCountsOrdered_mid.csv'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'WINDOWS');

tweets.csv

created_at,text,location
Sat Nov 03 13:31:07 +0000 2012, Sugar rush dfsudfhsu, Glasgow
Sat Nov 03 13:31:07 +0000 2012, Sugar rush ;dfsosjfd HAHAHHAHA, London
Sat Apr 25 04:08:47 +0000 2009, at Sugar rush dfjiushfudshf, Glasgow
Thu Feb 07 21:32:21 +0000 2013, Shell gggg, Glasgow
Tue Oct 30 17:34:41 +0000 2012, Shell dsiodshfdsf, Edinburgh
Sun Mar 03 14:37:14 +0000 2013, Shell wowowoo, Glasgow
Mon Jun 18 07:57:23 +0000 2012, Shell dsfdsfds, Glasgow
Tue Jun 25 16:52:33 +0000 2013, Shell dsfdsfdsfdsf, Glasgow

venues.csv

city,name
Glasgow, Sugar rush
Glasgow, ABC
Glasgow, University of Glasgow
Edinburgh, Shell
London, Big Ben

Solution

  • Instead of CROSS you might want to do "JOIN tweets BY location, venues BY city".

    Another try:

    The best I can think of is "To write UDF which loads all 10K venues and compile one regex pattern of all venue names (should fit in main memory = 10K*500bytes). The UDF would take the tweet message and output name of the venue matched. For each tweet message you will call this UDF. Because loading 10K venues in each mapper will take time, you might want to give more tweet messages to each mapper else you will be spending most of your time in loading the venues. I think what you are really gaining by doing this is not producing that 200M intermediate output.