Search code examples
regexregex-group

Regex pattern to strip all the numeric characters from the URL (except for the version number)


I am programming something in Java and I need to "normalize" the URIs, meaning, treat a URI as unique regardless of the query parameter values for timestamp, portalId, timeout, app version, etc.

Here's my regex pattern: (?<=/)[0-9]+

It works for the following URI: https://app.url.com/user/1234567

However, it doesn't work for the URI below. Is it possible to have one Regex pattern to accommodate both scenarios?

https://api.url.com/logging/v1/log/analytics-multi/no-auth?clientSendTimestamp=1622719272795&id=863256543&clienttimeout=14000&hs_static_app=automation-ui&hs_static_app_version=1.3520


Solution

  • The digits in the example seem to be after the / or the = as well as the version=

    What you might do is matching 1 or more digits asserting either a / or = to the left, but not for example version= to the left.

    (?<=[/=])(?<!version=)\d+
    

    The pattern matches:

    • (?<=[/=]) Positive lookbehind, assert either / or + directly to the left
    • (?<!version=) Negative lookbehind, assert not version= directly to the left
    • \d+ Match 1+ digits

    Regex demo