Search code examples
regexperlregex-greedylookbehindnegative-lookbehind

How to match words separated with single space vs words separated with multiple spaces


I need to separate the key and values from the text that looks like below

Student ID:  0
Department ID =          18432
Name                        XYZ

Subjects:
Computer Architecture
Advanced Network Security 2

In the above example Student ID, Department ID and Name are the keys and 0,18432, XYZ are values. The keys are separated from the values either by :,= or multiple spaces. I tried reg ex such as

    $line =~ /(([\w\(\)]*\s)*)([=:\s?]?)\s*(\S.*)?$/;
    $key   = $2;
    $colon=$3;
    $value = $4;

The problem I am facing is identifying when a word is separated with single space and when it is separated by more than one.

The output I get is line is Student ID: 0 key is Student , value is ID: 0 while I want key is Student ID and value is 0. For lines like Subjects: and Computer Architecture, the key should have Subjects and Computer Architecture. I have logic later when there is no value or colon, I append the strings to the previous key so it will look like Subjects=Computer Architecture;Advanced Network Security 2

Update: Thanks Ikegami for indicating that I use look behind operator. But I still seem to have problem solving it.

$line=~/^(?: ( [^:=]+ ) (?<!\s\s)\s* [:=]\s*|\s*)(.*)$/x;

So When I say (?<!\s\s)\s* [:=]\s*|\s* I mean when there more than two spaces, consume all the spaces and when there are no two consecutive spaces look for : or = and consume spaces. So if you pass below line to the expression, Shouldnt I be getting $1=Name and $2=ABC XYZ?

Name         ABC XYZ

What I seem to be getting is key is empty and value is Name ABC XYZ.


Solution

  • If

    Name Eric Brine
    Computer Architecture x86
    

    means

    key: Name Eric               value: Brine
    key: Computer Architecture   value: x86
    

    then you want

    # Requires 5.10
    if (/
       ^
       (?: (?<key> [^:=]+ (?<!\s) ) \s* [:=] \s* (?<val> .*  )
       |   (?<key> .+     (?<!\s) ) \s+          (?<val> \S+ )
       )
       \s* $
    /x) {
       my $key = $+{key};
       my $val = $+{val};
       ...
    }
    

    or

    if (/
       ^
       (?: ( [^:=]+ (?<!\s) ) \s* [:=] \s* ( .*  )
       |   ( .+     (?<!\s) ) \s+          ( \S+ )
       )
       \s*
       ( .* )
    /x) {
       my ($key,$val) = defined($1) ? ($1,$2) : ($3,$4);
       ...
    }
    

    If

    Name Eric Brine
    Computer Architecture x86
    

    means

    key: Name       value: Eric Brine
    key: Computer   value: Architecture x86
    

    then you want

    # Requires 5.10
    if (/
       ^
       (?: (?<key> [^:=]+ (?<!\s) ) \s* [:=]
       |   (?<key> \S+ ) \s
       )
       \s*
       (?<val> .* )
    /x) {
       my $key = $+{key};
       my $val = $+{val};
       ...
    }
    

    or

    if (/
       ^
       (?: ( [^:=]+ (?<!\s) ) \s* [:=]
       |   ( \S+ ) \s
       )
       \s*
       ( .* )
    /x) {
       my $key = defined($1) ? $1 : $2;
       my $val = $3;
       ...
    }
    

    Note that you can remove all the space and line breaks. For example, the last snippet can be written as:

    if (/^(?:([^:=]+(?<!\s))\s*[:=]|(\S+)\s)\s*(.*)/) {
       my $key = defined($1) ? $1 : $2;
       my $val = $3;
       ...
    }