Search code examples
regexcsvextractrowstalend

Talend DI : extract many pairs of rows from a log file (using regex)


I am using TALEND DATA INTEGRATION

I have a log file like this

I - Fab - 392 - 2014/12/20 22:09:15:200 - XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Begin : 

I - Fab - 392 - 2014/12/20 22:12:15:438 - XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Bus / Before :

500|00104|002PL|0036364043        |005PL

809|001BBG|00365   |005-0200|006+0000|007000|0080000|0240|0250|0260|0270|0280|0290|033STK|034063100       |0441

830|0093100       |0441

I - Fab - 392 - 2014/12/20 22:12:19:766 - XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Bus / After : 

500|00104|002PL|0036364043        |005PL

510|001BBG|00365   |005-0200|006+0000|007000|0080000|0240|0250|0260|0270|0280|0290|033STK|034063100       |0441

I want to extract the lines 2&3 and 6&7 (it's not always pair and impair). Anyway, I used a regular expression :

"I - (Fab|Opt) - \\d+ - (\\d{4}/\\d{2}/\\d{2}) (\\d{2}:\\d{2}:\\d{2}:\\d{3}) - .+ Bus / (.+) : \\n500|.+|003(\\d{7}).+"

using a tFileInputRegex, however I don't know what to use in the row separator (by default "\n")

I want my output to be a CSV file in which there are data extracted from the first and second lines.

I used a tMap to generate a CSV file, but the problem is I cannot extract the data I want.

If I extract the data I want I will be able to generate the file. So, I need help in the regex part. I wonder if there's a way in Talend DI to extract multiple rows (in my case TWO) using tFileInputRegex.

EDIT :

I have specified I - as a row separator, so I can be able to use \n (without any confusion), but the regex doesn't seem functional.


Solution

  • The \n delimiter for multiline (rows) should work, so it's more an issue of your overall regex. Try using a pattern such as this, for it should capture the groups correctly:

    I.+(Fab|Opt).+(\\d{4}\\/\\d{2}\\/\\d{2}).+(\\d{2}:\\d{2}:\\d{2}:\\d{3}).+Bus\\s\\/\\s(\\w+)\\s:\\W+\\n500.+003(\\d{7}).+
    

    Example:

    https://www.regex101.com/r/nL2xT7/