Search code examples
htmlbashsed

How to remove all script tags from html file


How do I remove all script tags in html file using sed?

I try with this but doesn't work, the command below doesn't remove any script tag from test1.html.

$ sed -e 's/<script[.]+<\/script>//g' test1.html > test1_output.html

My goal is from test1.html to test1_output.html

test1.html:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
    </head>
    <body>
        <h1>My Website</h1>

        <div class="row">
            some text
        </div>

        <script  type="text/javascript"> utmx( 'url', 'A/B' );</script>

        <script src="ga_exp.js" type="text/javascript" charset="utf-8"></script>    
        <script type="text/javascript">
            window.exp_version = 'control';
        </script>        
    </body>
</html>

test1_output.html:

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8">
    </head>
    <body>
        <h1>My Website</h1>

        <div class="row">
            some text
        </div>

    </body>
</html>

Solution

  • If I understood correctly your question, and you want to delete everything inside <script></script>, I think you have to split the sed in parts (You can do it one-liner with ;):

    Using:

    sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'
    

    The first piece (s/<script>.*<\/script>//g) will work for them when in one line;

    The second section (/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}}) is almost a quote to @akingokay answer, only that I excluded the lines of occurrence (Just in case they have something before or after). Great explanation of that in here Using sed to delete all lines between two matching patterns;

    The last two (s/<script>.*//g and s/.*<\/script>//g) finally take care of the lines that start and don't finish or don't start and finish.

    Now if you have an index.html that has:

    <html>
      <body>
            foo
            <script> console.log("bar) </script>
      <div id="something"></div>
            <script>
                    // Multiple Lines script
                    // Blah blah
            </script>
            foo <script> //Some
            console.log("script")</script> bar
      </body>
    </html>
    

    and you run this sed command, you will get:

    cat index.html | sed 's/<script>.*<\/script>//g;/<script>/,/<\/script>/{/<script>/!{/<\/script>/!d}};s/<script>.*//g;s/.*<\/script>//g'
    <html>
      <body>
        foo
    
    
            <div id="something"></div>
    
    
    
    
        foo 
     bar
      </body>
    
    </html>
    

    Finally you will have a lot of blank spaces, but the code should work as expected. Of course you could easily remove them with sed as well.

    Hope it helps.

    PS: I think that @l0b0 is right, and this is not the correct tool.