Search code examples
regexurlextract

Regex[Python] Extract from url path parameters


I have an URLs from the access log. Example: /someService/US/getPersonFromAllAccessoriesByDescription/67814/alloy%20nudge%20w

/someService/NZ/asdNmasdf423-asd342e/getDealerFromSomethingSomething/FS443GH/front%20parking%20sen

I cannot make any assumption on the service name or the function name.

I'm trying to find a regex that can only match in the first log:

67814
alloy%20nudge%20w

and in the second:

asdNmasdf423-asd342e
FS443GH
front%20parking%20sen

with some heuristic, I tried to use [a-zA-Z0-9_%-]{15,}|[A-Z0-9]{5,} match only long strings but the function names(getPersonFromAllAccessoriesByDescription, getDealerFromSomethingSomething) also had been caught.

I was thinking about regex that can do the same as [a-zA-Z0-9_%-]{15,} but with condition that it must be at least one digit, so this way the function names will be skipped.

Thank you


Solution

  • Your heuristics is fine, use

    \b(?=[a-zA-Z_%-]*[0-9])[a-zA-Z0-9_%-]{5,}
    

    See proof.

    Explanation

    --------------------------------------------------------------------------------
      \b                       the boundary between a word char (\w) and
                               something that is not a word char
    --------------------------------------------------------------------------------
      (?=                      look ahead to see if there is:
    --------------------------------------------------------------------------------
        [a-zA-Z_%-]*             any character of: 'a' to 'z', 'A' to
                                 'Z', '_', '%', '-' (0 or more times
                                 (matching the most amount possible))
    --------------------------------------------------------------------------------
        [0-9]                    any character of: '0' to '9'
    --------------------------------------------------------------------------------
      )                        end of look-ahead
    --------------------------------------------------------------------------------
      [a-zA-Z0-9_%-]{5,}       any character of: 'a' to 'z', 'A' to 'Z',
                               '0' to '9', '_', '%', '-' (at least 5
                               times (matching the most amount possible))