Search code examples
hadoopapache-piglog-analysis

Log analysis with Apache Pig


I have logs with this rows:

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

where the first column (in24.inetnebr.com) is the host, the second (01/Aug/1995:00:00:01 -0400) is the timestamp, the 3rd (GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0) is the downloaded page.

How can I find the last two downloaded page for every hosts with Pig?

Thank you very much for your help!


Solution

  • I've solved the problem, FYI:

    REGISTER piggybank.jar
    DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();
    
    raw = LOAD 'nasa' USING org.apache.hcatalog.pig.HCatLoader(); --cast the data, to make possible the usage of string functions
    
    rawCasted = FOREACH raw GENERATE (chararray)host as host, (chararray)xdate as xdate,(chararray)address as address; --cut out the date, and put together the used columns
    
    rawParsed = FOREACH rawCasted GENERATE host, SUBSTRING(xdate,1,20) as xdate, address; --make sure that the not full columns are omitted
    
    rawFiltered = FILTER rawParsed BY xdate IS NOT NULL; --cast the timestamp to timestamp format
    
    analysisTable = FOREACH rawFiltered GENERATE host, ToDate(xdate, 'dd/MMM/yyyy:HH:mm:ss') as xdate, address;
    
    aTgrouped = GROUP analysisTable BY host;
    
    resultsB = FOREACH aTgrouped {
    elems=ORDER analysisTable BY xdate DESC;
    two=LIMIT elems 2; --Choose the last two page
    
    fstB=ORDER two BY xdate DESC;
    fst=LIMIT fstB 1; --Choose the last page
    
    sndB=ORDER two BY xdate ASC;
    snd=LIMIT sndB 1; --Choose the previous page
    
    GENERATE FLATTEN(group), fst.address, snd.address; --Put together the pages
    };
    DUMP resultsB;