Search code examples
htmlregexrobotframework

Robot Framework - how to strip out javascript tags from html using Remove String Using Regexp keyword


I working on a test case that visits a page, gets the page source and saves it into an html file. Before saving the source code, I need to strip out all javascript from "" to "". I've gone through numerous online resources and come up with <script type="text/javascript">([\\s\\S]*?)<\\/script> but the regular expression syntax I enter into the test case does not seem to work. Does anyone have any suggestions?

More Info: The page source code contains many instances of JavaScript and spans multiple lines so I believe I need to prefix the expression with (ims). In my solution above, you'll also see that I've escaped the backslashes since I read somewhere that it was necessary.

Example of the source code:

<html>
<script type="text/javascript">
some multiline javascript
  </script>
<script type="text/javascript"> some single line javascript  </script>
<body>
body content
</body>
<script type="text/javascript">
some more javascript
</script>


Solution

  • Here is my try:

    "<script[^>]*>[^\0]*?<\/script>", gi
    

    Regex live here.

    Explaining:

    #   <script              # match the start of the tag
    #   [^>]*>               # match anything till the ">" character
    #   [^\0]*?<\/script>    # match anything (not null) till the closing tag