Search code examples
regexsecuritywysihtml5

Regular expression to find all DOM event listeners


I am actually trying to make the result of a wysihtml5 editor secure.
Basically, users cannot enter script/forms/etc tags.

I cannot remove all tags as some of them are used to display the content as wished.
(eg : <h1> to display a title)

The problem is that users are still able to add DOM event listeners binded to some unwanted code.
(eg : <h1 onclick="alert('Houston, got a problem');"></h1>)

I would like to remove all event listeners inside a div (for all descendants inside that div).
The solution I actually tried to use is to check the code as a string to find and replace unwanted content, which worked for the unwanted tags.

What I actually need is a regex matching all event listeners inside all tags.
Something like "select all [on*] between < and >".
Examples :
<h1 onclick=""></h1> => Should match
<h1 onnewevent=""></h1> => Should match
<h1>onclick=""</h1> => Should NOT match

Thanks in advance for your help ;)


Solution

  • Shouldn't be parsing html with regex.
    If you really want to though, this is a quick and dirty way
    (by no means complete).

    It just looks for opening 'onevent' tag with its closing tag right after it.
    If there will be something else inbetween, just add a .*? between tags.

     #  <([^<>\s]+)\s[^<>]*on[^<>="]+=[^<>]*></\1\s*>
     # /<([^<>\s]+)\s[^<>]*on[^<>="]+=[^<>]*><\/\1\s*>/
    
     < 
     ( [^<>\s]+ )                    # (1), 'Tag'
     \s 
     [^<>]* on [^<>="]+ = [^<>]*     # On... = event
     >
     </ \1 \s* >                     # Backref to 'Tag'
    

    Perl test case

    $/ = undef;
    
    $str = <DATA>;
    
    while ( $str =~ /<([^<>\s]+)\s[^<>]*on[^<>="]+=[^<>]*><\/\1\s*>/g )
    {
        print "'$&'\n";
    }
    
    
    __DATA__
    (eg : <h1 onclick="alert('Houston, got a problem');"></h1>) 
    
    I would like to remove all event listeners inside a div
    (for all descendants inside that div).
    The solution I actually tried to use is to check the code as
    a string to find and replace unwanted content,
    which worked for the unwanted tags. 
    
    What I actually need is a regex matching all event
    listeners inside all tags.
    Something like "select all [on*] between < and >".
    Examples :
    <h1 onclick=""></h1> => Should match
    <h1 onnewevent=""></h1> => Should match
    <h1>onclick=""</h1> => Should NOT match 
    

    Output >>

    '<h1 onclick="alert('Houston, got a problem');"></h1>'
    '<h1 onclick=""></h1>'
    '<h1 onnewevent=""></h1>'