Search code examples
phppreg-match-allsubstrstrpos

Regular expression alternative for scraping certain syntax inside html


I have functions and placed inside of html code. These functions has this following syntax rules:

  1. There is '#' symbol as an opened tag
  2. There is a function name after the opened '#' tag. The function name can contain number (1,2,3),alphabet (a,b,c), and underscore (_).
  3. After function name, there is a pair of brackets contain of paramater. The paramater can contain anything including alphanumeric, arithmetic operator (<,>,=,!), and this: @,#,$,%,^,&,(,),?,*,/,[,]
  4. After the parameter, there is html code which is put inside of curly bracket.
  5. Finaly the function closed using '#' tag.

This is not my real function but it give the whole ideas of rules above:

<html>
#v123w(r(!@3o=?w){
<div></div>
}#
#131ie_w(13gf$>&*()(*&){
<div></div>
}#
</html>

All this time, I'm using this regex to capture the all of the function names, parameters, and the html strings inside functions:

#(\w+)\(*([\w\d\s\=\>\<\[\]\"\'\)\(\&\|\*\+\-\%\@\^\?\/\$\.\!]*)\)\)*{((?:(?R)|.)*?)}#

This is the result:

enter image description here

You can see the detail in regex tester: https://regex101.com/r/HdCeeV/1

Currently I found that preg_match_all function in php does not work for a long string. Thus, I cannot use this regex if the html code inside the function is too long. I need to capture the function name, function parameter, and html string inside the function. Is there any alternative for this regex? Maybe using PHP file function like substr, strpos, etc?


Solution

  • Here is an improvement of your regex, a little bit more efficient:

    #(\w+)\(([\w\s=><[\]"')(&|*+%@^?\/$.!-]*)\){(.+?)}#
    

    Demo & Explanation