Search code examples
regexpyspark

Regular expression to extract Hyphenated text including other special characters between multiple hyphens


I have a cases where I need to extract anything in between multiple hyphens together. For example

------- General Data ------------------------------------------------------------
O/p: General Data

------- Protocol and Sequence Data ----------------------------------------------

O/p: Protocol and Sequence Data

------- Start Check Data - (before measurement - may be cached data) ------------

o/p: Start Check Data - (before measurement - may be cached data)

I have used this particular regex to try and achieve it

(?<=-)[^-]+(?=)

but this regex is able to get the first 2 cases correctly but not the 3rd case example. It matches the 3rd case but it only displays, for example for

------- Start Check Data - (before measurement - may be cached data) ------------
it gives o/p Start Check Data instead of the whole thing

Solution

  • Assuming a space will always appear before and after a series of hyphens and dealing with the match's length of a lookbehind that must be fixed, here is a possible regex:

    (?<=-- ).+(?= -{2,})
    

    lookbehind for 2 hyphens and a space :

    (?<=-- )
    

    match everything in a greedy way :

    .+
    

    (switch to lazy match if needed by adding a '?') :

    .+?
    

    lookahead for a space followed by at least 2 hyphens :

    (?= -{2,})