Search code examples
regexedifact

matching numbers after nth occurence of a certain symbol in a line


I'm not sure if using regex is the correct way to go about this here, but I wanted to try solving this with regex first (if it's possible)

I have an edifact file, where the data (in bold) in certain fields in some segments need to be substituted (with different dates, same format)

UNA:+,? '  
UNB+UNOC:3+000000000+000000000+20190801:1115+00001+DDMP190001'  
UNH+00001+BRKE:01+00+0'    
INV+ED Format 1+Brustkrebs+19880117+E000000001+**20080702**+++1+0'       
FAL+087897044+0000000++name+000000000+0+**20080702**++1+++J+N+N+N+N+N+++0'   
INL+181095200+385762115+++0'   
BEE+20080702++++0'   
BAA+++J+J++++++J+++++++J++0'   
BBA++++++++J++++++J+J++++++J+++++J+++J+J++++++++J+0'   
BHP+J+++++J+++++J+++++0'   
BLA+++J+++++++++0'   
BFA++++++++++++J++0'   
BSA++J+++J+J+++0'    
BAT+20190801+0'    
DAT+**20080702**++++0'   
UNT+000014+00001'   
UNZ+00001+00001'   

at first I was able to match those fields using a positive lookahead and a lookbehind (I had different expressions for matching each date).

Here, for example is the expression I intially used to match the date in the "FAL" segment: (?<=\+[\d]{1}\+)\d{8}(?=\+\+), but then i saw that this date is sometimes preceeded by 9 digits, and sometimes by 1 (based on version) and followed by a either ++ or a + and a date so I added a logiacl OR like this: (?<=\+[\d]{9}\+|\+[\d]{1}\+)\d{8}(?=\+[\d]{8}\+|\+\+)and quickly realized it's not sustainable because I saw that these edifact files vary (far beyond only either 9 and 1 digits)

(I have 6 versions for each type, and i have 6 types total)

Because I have a scheme/map indicating what each version should be built like and I know on what position (based on the + separator) the date is written in each version, I thought about maybe matching the date based on the +, so after the 7th occurence (say in the FAL segment) of plus in a certain line, match the next 8 digits.

is this possible to achieve with regex? and if yes, could someone please tell me how?


Solution

  • I suggest using a pattern like

    ^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+)
    

    where {7} can be adjusted to the value you need for each type of segments, and replace with the backreference to Group 1. In Python, it is \g<1>20200101 (where 20200101 is your new date), in PHP/.NET, it is ${1}20200101. In JS, it will be just $1.

    To run on a multiline text, use m flag. In Python regex, you may embed it like (?m)^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+).

    See the Python regex demo

    Details

    • ^ - start of string/line
    • ((?:[^+\n]*\+){7}) - Group 1: 7 repetitions of any chars other than + and newline, and then a +
    • \d{8} - 8 digits
    • (?=\+(?:\d{8})?\+) - that are followed with +, and optional chunk of 8 digits and a +.