Search code examples
apache-pig

Apache-pig Number Extraction After a specific String


I have a file with 10,1900 lines with Delimiter as 5 ('|') [obviously 6 columns now] , and I have statement in sixth column like "Dropped 12 (0.01%)" !! I am longing to extract the number after Dropped within brackets;

Actual -- Dropped 12 (0.01%)

Expected -- 0.01

I need a solution using Apache pig.


Solution

  • You are looking for the REGEX_EXTRACT function.

    Let's say you have a table A that looks like:

    +--------------------+
    |        col1        |
    +--------------------+
    | Dropped 12 (0.01%) |
    | Dropped 24 (0.02%) |
    +--------------------+
    

    You can extract the number in parenthesis with the following:

    B = FOREACH A GENERATE REGEX_EXTRACT(col6, '.*\\((.*)%\\)', 1);
    
    +---------+
    | percent |
    +---------+
    | 0.01    |
    | 0.02    |
    +---------+
    

    I'm specifying a regex capture group for whatever characters are between ( and %). Notice that I'm using \\ as the escape character so that I match the opening and closing parenthesis.