Search code examples
phpregexfilterpcre

Regex to convert video from custom input into HTML


I'm currently trying to write a filter which will convert some simple input text such as Markdown or plain text into some HTML. The idea is to give the ability to the end user to add some videos into the content. So the input could contain simple Markdown and then some tags, looking like this:

[video url:"https://www.youtube.com/watch?v=EkluES9Rvak" width=100% ratio='16/9'
autoplay:1 caption:"Lea Verou - Regexplained"]

I want to be rather soft in the syntax and allow : or = between the attribute name and the value. And like in HTML, values can be optionally single or double quoted to solve problems with spaces or special chars. And this is where I start to struggle!

For the moment, I wrote this regex in PHP:

/(?(DEFINE)
# This sub-routine will match an attribute value with or without the quotes around it.
# If the value isn't quoted then we can't accept spaces, quotes or the closing ] tag.
(?<attr_value_with_delim>(?:(?<delimiter>["']).*?(?:\k<delimiter>)|[^"'=\]\s]+))
)

\[
\s*video\s+
(?=[^\]]*\burl[=:](?<url>\g<attr_value_with_delim>))      # Mandatory URL
(?=[^\]]*\bwidth[=:](?<width>\g<attr_value_with_delim>))? # Optional width
(?=[^\]]*\bratio[=:](?<ratio>\g<attr_value_with_delim>))? # Optional ratio
(?=[^\]]*\bautoplay[=:](?<autoplay>\g<attr_value_with_delim>))? # Optional autoplay
(?=[^\]]*\bcaption[=:](?<title>\g<attr_value_with_delim>))? # Optional caption
[^\]]*
\]/guxs

You can test it here: https://regex101.com/r/hVsav8/1

The optional attribute values are captured so that I don't need to reparse the matched tags a second time.

My questions:

  • How can I handle the problem of a ] inside the value of an attribute?

  • Would it be possible to capture the value without the quotes?

    It's not very important as I can get rid of it later with trim(..., '"\'') in the callback but I would be interested to see if there's a pattern solution.


Solution

  • Subroutines:

    (?(DEFINE)
    
    # Match quote-delimited values
    
      (?<attr_value_with_delim>
        '(?:\\.|[^'])*'
      |
        "(?:\\.|[^"])*"
      )
    
    # Match non-quote-delimited values
    
      (?<attr_value_without_delim>[^'"\s[\]]+)
    
    # Match both types
    
      (?<attr_value>
        \g<attr_value_with_delim>
      |
        \g<attr_value_without_delim>
      )
    
    # Match attr - value pairs in the following forms:
    ## attr:value
    ## attr=value
    ## attr:"value'[]=:"
    ## attr='value"[]=:'
    ## attr:"val\"ue"
    ## attr:'val\'ue'
    
      (?<attr_with_value>
        \s+[a-zA-Z]+[:=]\g<attr_value>
      )
    )
    

    Actual matching pattern:

    \[\s*                             # Opening bracket followed by optional whitespaces
    video                             # Literal 'video'
    \g<attr_with_value>*              # 0+ attribute - value pairs
    (?:                               #
      \s+                             # Preceding whitespaces
      url[:=]                         # Literal 'url' followed by either ':' or '='
      (?:                             # 
        '\s*(?:\\.|[^'\s])+\s*'       # Single or double quote-delimited,
      |                               # space-surrounded (optional)
        "\s*(?:\\.|[^"\s])+\s*"       # URL that doesn't contain whitespaces
      |                               #
        \g<attr_value_without_delim>  # or a non-quote-delimited value
      )                               #
    )                                 #
    \g<attr_with_value>*              # 0+ attribute - value pairs
    \s*\]                             # Optional whitespaces followed by closing bracket
    

    This regex matches a video notation which can then be further parsed using legitimate and non-devilish ways. It proves but parsing HTML-like content with regex is strongly discouraged.

    Try it on regex101.com.