I'm currently trying to write a filter which will convert some simple input text such as Markdown or plain text into some HTML. The idea is to give the ability to the end user to add some videos into the content. So the input could contain simple Markdown and then some tags, looking like this:
[video url:"https://www.youtube.com/watch?v=EkluES9Rvak" width=100% ratio='16/9'
autoplay:1 caption:"Lea Verou - Regexplained"]
I want to be rather soft in the syntax and allow :
or =
between the attribute
name and the value. And like in HTML, values can be optionally single or double quoted
to solve problems with spaces or special chars. And this is where I start to struggle!
For the moment, I wrote this regex in PHP:
/(?(DEFINE)
# This sub-routine will match an attribute value with or without the quotes around it.
# If the value isn't quoted then we can't accept spaces, quotes or the closing ] tag.
(?<attr_value_with_delim>(?:(?<delimiter>["']).*?(?:\k<delimiter>)|[^"'=\]\s]+))
)
\[
\s*video\s+
(?=[^\]]*\burl[=:](?<url>\g<attr_value_with_delim>)) # Mandatory URL
(?=[^\]]*\bwidth[=:](?<width>\g<attr_value_with_delim>))? # Optional width
(?=[^\]]*\bratio[=:](?<ratio>\g<attr_value_with_delim>))? # Optional ratio
(?=[^\]]*\bautoplay[=:](?<autoplay>\g<attr_value_with_delim>))? # Optional autoplay
(?=[^\]]*\bcaption[=:](?<title>\g<attr_value_with_delim>))? # Optional caption
[^\]]*
\]/guxs
You can test it here: https://regex101.com/r/hVsav8/1
The optional attribute values are captured so that I don't need to reparse the matched tags a second time.
My questions:
How can I handle the problem of a ]
inside the value of an attribute?
Would it be possible to capture the value without the quotes?
It's not very important as I can get rid of it later with trim(..., '"\'')
in the callback but I would be interested to see if there's a pattern solution.
Subroutines:
(?(DEFINE)
# Match quote-delimited values
(?<attr_value_with_delim>
'(?:\\.|[^'])*'
|
"(?:\\.|[^"])*"
)
# Match non-quote-delimited values
(?<attr_value_without_delim>[^'"\s[\]]+)
# Match both types
(?<attr_value>
\g<attr_value_with_delim>
|
\g<attr_value_without_delim>
)
# Match attr - value pairs in the following forms:
## attr:value
## attr=value
## attr:"value'[]=:"
## attr='value"[]=:'
## attr:"val\"ue"
## attr:'val\'ue'
(?<attr_with_value>
\s+[a-zA-Z]+[:=]\g<attr_value>
)
)
Actual matching pattern:
\[\s* # Opening bracket followed by optional whitespaces
video # Literal 'video'
\g<attr_with_value>* # 0+ attribute - value pairs
(?: #
\s+ # Preceding whitespaces
url[:=] # Literal 'url' followed by either ':' or '='
(?: #
'\s*(?:\\.|[^'\s])+\s*' # Single or double quote-delimited,
| # space-surrounded (optional)
"\s*(?:\\.|[^"\s])+\s*" # URL that doesn't contain whitespaces
| #
\g<attr_value_without_delim> # or a non-quote-delimited value
) #
) #
\g<attr_with_value>* # 0+ attribute - value pairs
\s*\] # Optional whitespaces followed by closing bracket
This regex matches a video notation which can then be further parsed using legitimate and non-devilish ways. It proves but parsing HTML-like content with regex is strongly discouraged.
Try it on regex101.com.