Search code examples
regexstringkettlepdi

Delete portions of a string using regex in PDI KETTLE


I'm trying to clean a string with "Replace in string" step in PDI KETTLE.

The input string looks like this:

<p class="MsoNormal" style="FONT-SIZE: 11pt; mso-ansi-language: ES"> AAA <p></p></span></p> <p class="MsoNormal" style="FONT-SIZE: 11pt; mso-ansi-language: ES"> BBB <personname w:st="on"> CCC.

The desired output would be to delete string portions between every '<' and '>' chars, to get this:

AAA  BBB  CCC.

Looking for similar questions, I tried with this one Replace string using regular expression in KETTLE

In a "Replace in string" step, I use RegEx, search for (<(.*)>) and nothing to replace with.

But the problem is that it deletes everything bewteen the first '<' and the last '>' chars, and the output is:

CCC.

How should I build the RegEx expression?


Solution

  • The problem is that your (.*) is greedy, therefore it'll capture everything up to the last >.

    To make it lazy you can either:

    • Make your quantifier lazy, by using (<(.*?)>)
    • explicitly set the class of characters you want to capture, (<([^>]*)>)

    Either should work and produce as output

     AAA   BBB  CCC.