I'm trying to extract certain data from LookML, a specific markup language. If this is example code:
explore: explore_name {}
explore: explore_name1 {
label: "name"
join: view_name {
relationship: many_to_one
type: inner
sql_on: ${activity_type.activity_name}=${activity_type.activity_name} ;;
}
}
explore: explore_name3 {}
Then I would receive a list looking like:
explore: character_balance {}
label: "name"
join: activity_type {
relationship: many_to_one
type: inner
sql_on: ${activity_type.activity_name}=${activity_type.activity_name} ;;
}```
explore: explore_name4 {}
Essentially, I start a match at "explore" and end it when I find another "explore" - which would then begin the next match.
Here's what I had before, which matches across all the lines until it finds a ;
, and this works perfectly fine: 'explore:\s[^;]*'
. But, this stops at a ';', assuming there is one.
How would I change this so that it takes out everything between 'explore' and 'explore'? Simply replacing the ';' in my regex with 'explore' instead stops whenever it finds a letter that matches anything in [e,x,p,l,o,r,e] - which is not the behavior I want. Removing the square brackets and the ^ ends up breaking everything so that it can't query across multiple lines.
What should I do here?
A naive approach consists to reach the next "explore" word. But if for any reason, a string value contains this word, you will get wrong results. Same problem if you try to stops using curly brackets when the string contains nested brackets.
That's why I suggest a more precise description of the syntax of your string that takes in account strings and nested curly brackets. Since the re module doesn't have the recursion feature (to deal with nested structure), I will use the pypi/regex module instead:
import regex
pat = r'''(?xms)
\b explore:
[^\S\r\n]* # optional horizontal whitespaces
[^\n{]* # possible content of the same line
# followed by two possibilities
(?: # the content stops at the end of the line with a ;
; [^\S\r\n]* $
| # or it contains curly brackets and spreads over eventually multiple lines
( # group 1
{
[^{}"]*+ # all that isn't curly brackets nor double quotes
(?:
" [^\\"]*+ (?: \\. [^\\"]* )*+ " # contents between quotes
[^{}"]*
|
(?1) # nested curly brackets, recursion in the group 1
[^{}"]*
)*+
}
)
)'''
results = [x.group(0) for x in regex.finditer(pat, yourstring)]
To be more rigorous, you can add supports for single quoted string, and also prevent that the "explore:" at the start of the pattern isn't in a string using a (*SKIP)(*FAIL)
construct.