Search code examples
pentahokettle

Test multiple regex on each document


I am getting all documents from a mongodb collection (millions), and I have a lot of regex in a postgreSQL. I wanted to test each regex until one match on multiple fields containded in documents.

Do you have any idea how to do that ?

I tried with a Filter Row step, but I can't figure how to loop over all regex from postgreSQL.


Solution

  • You can solve your problem by using a Join rows (Cartesian Product) component. One of your inputs will have to read in the docs, the other will have to read in the regular expressions. The join component will create a outer product from these resulting in every possible combination of regex expressions and docs. This stream you will have to feed into the Filter Rows component and send the result to some output.

    The following transformation will mimick this approach (it reads from CSV files but that should not make any difference to reading it from postgreSQL or MungoDB):

    transformation

    The input data for "documents" is configured as follows:

    csv input for docs

    The input data for "regular expressions" is configured as follows:

    csv input for regular expressions

    The Join Rows does not have to be configured at all since we will NOT provide a join condition and hence making it effectively an full outer join.

    In the Filter component you will have to use the DOC_TEXT and the REGEX_TEXT fields to execute the check base upon REGEXP operator.

    enter image description here

    For this document input

    DOC_ID;DOC_TEXT
    1;DFGBGGG
    2;UHLLJAL
    3;JJJJHHH
    4;FGAKKBL
    

    and this regex input

    REGEX_ID;REGEX_TEXT
    1;.*A.*
    2;.*B.*
    

    the transformation will output the following result:

    DOC_ID;DOC_TEXT;REGEX_ID;REGEX_TEXT
    1;DFGBGGG;2;.*B.*
    2;UHLLJAL;1;.*A.*
    4;FGAKKBL;1;.*A.*
    4;FGAKKBL;2;.*B.*