Search code examples
regexregex-group

Regex that extract string of length that is encoded in string


I have the following string to parse:

X4IitemX6Nabc123

that is structured as follows:

  • X... marker for 'field identifier'
  • 4... length of item (name), will change according to length of item name
  • I... identifier for item name, must not be extracted, fixed
  • item... value that should be extraced as "name"
  • X... marker for 'field identifier'
  • 6... length of item (name), will change according to length of item name
  • N... identifier for item number, must not be extracted, fixed
  • abc123... value that should be extraced as "num" Only these two values will be contained in the string, the sequence is also always the same (name, nmuber).

What I have so far is

\AX(?I<namelen>\d+)U(?<name>.+)X(?<numlen>\d+)N(?<num>.+)$

But that does not take into account that the length of the name is contained in the string itself. Somehow the .+ in the name group should be replaced by .{4}. I tried {$1}, {${namlen}} but that does not yield the result I expect (on rubular.com or regex.191)

Any ideas or further references?


Solution

  • What you ask for is only possible in languages that allow code insertions in the regex pattern.

    Here is a Perl example:

    #!/usr/bin/perl
    use warnings;
    use strict;
    
    my $text = "X4IitemX6Nabc123";
    if ($text =~ m/^X(?<namelen>[0-9]+)I(?<name>(??{".{".$^N."}"}))X(?<numlen>[0-9]+)N(?<num>.+)$/) {
        print $text . ": PASS!\n";
    } else {
        print $text . ": FAIL!\n"
    }
    # -> X4IitemX6Nabc123: PASS!
    

    In other languages, use a two-step approach:

    1. Extract the number after X,
    2. Build a regex dynamically using the result of the first step.

    See a JavaScript example:

    const text = "X4IitemX6Nabc123";
    const rx1 = /^X(\d+)/;
    const m1 = rx1.exec(text)
    if (m1) {
      const rx2 = new RegExp(`^X(?<namelen>\\d+)I(?<name>.{${m1[1]}})X(?<numlen>\\d+)N(?<num>.+)$`)
      if (rx2.test(text)) {
        console.log(text, '-> MATCH!')
      } else console.log(text, '-> FAIL!');
    } else {
      console.log(text, '-> FAIL!')
    }

    See the Python demo:

    import re
    text = "X4IitemX6Nabc123"
    rx1 = r'^X(\d+)'
    m1 = re.search(rx1, text)
    if m1:
      rx2 = fr'^X(?P<namelen>\d+)I(?P<name>.{{{m1.group(1)}}})X(?P<numlen>\d+)N(?P<num>.+)$'
      if re.search(rx2, text):
        print(text, '-> MATCH!')
      else:
        print(text, '-> FAIL!')
    else:
      print(text, '-> FAIL!')
    
    # => X4IitemX6Nabc123 -> MATCH!