Search code examples
regexpcresplunk

Regex breaks with "/" character instead of newline


I tried making a RegEx for Splunk search that should extract the TLD from URLs. The source is Panorama Logs.

RegEx: ^(?:https?:\/\/)?(?<host>[^\/]+)?(?<tld>\.[^.?\/\n]+).*$

Test data:

https://example.org/
qq.com
https://border.example.com/?bridge=basket&blood=animal
360.cn
http://example.com/?brother=bike
smugmug.com
shop-pro.jp

The RegEx and testdata is on Regex101.com; I generated the test-data using randomlists.com to anonymize source data. The capture-group <tld> is needed; <host> is only there for readability.

Describe what you tried,

Matching the TLDs from a set of URLs; some with a preceding protocol, some without. Input records should be separated by newlines, and matches should never be longer than one record.

what you expected to happen,

All TLDs are matched and in the capture-group <tld>.

and what actually resulted.

Lines ending with a / work, but lines without don't.


Solution

  • To get the TLD you can use:

    /^[^.\n]*[^\/\n]*\.\K[^\/\n]+/gm
    

    demo

    (If you search a single line, \ns aren't needed, you can remove them all form the character classes)

    Explanation:

    Since it doesn't exclude the slash, the first character class with a greedy quantifier "jumps" over the slashes of the eventual scheme part and stop at the first dot.
    The second character class that excludes the slash, stops at the first slash of the path part and then the backtracking comes to the game until it reach the last dot of the domain.