I am getting all documents from a mongodb collection (millions), and I have a lot of regex in a postgreSQL. I wanted to test each regex until one match on multiple fields containded in documents.
Do you have any idea how to do that ?
I tried with a Filter Row
step, but I can't figure how to loop over all regex from postgreSQL.
You can solve your problem by using a Join rows (Cartesian Product) component. One of your inputs will have to read in the docs, the other will have to read in the regular expressions. The join component will create a outer product from these resulting in every possible combination of regex expressions and docs. This stream you will have to feed into the Filter Rows component and send the result to some output.
The following transformation will mimick this approach (it reads from CSV files but that should not make any difference to reading it from postgreSQL or MungoDB):
The input data for "documents" is configured as follows:
The input data for "regular expressions" is configured as follows:
The Join Rows does not have to be configured at all since we will NOT provide a join condition and hence making it effectively an full outer join.
In the Filter component you will have to use the DOC_TEXT
and the REGEX_TEXT
fields to execute the check base upon REGEXP
operator.
For this document input
DOC_ID;DOC_TEXT
1;DFGBGGG
2;UHLLJAL
3;JJJJHHH
4;FGAKKBL
and this regex input
REGEX_ID;REGEX_TEXT
1;.*A.*
2;.*B.*
the transformation will output the following result:
DOC_ID;DOC_TEXT;REGEX_ID;REGEX_TEXT
1;DFGBGGG;2;.*B.*
2;UHLLJAL;1;.*A.*
4;FGAKKBL;1;.*A.*
4;FGAKKBL;2;.*B.*