Search code examples
xmlregexsedwell-formed

sed: use regex to replace '<' char with '&lt;' in many XML files


I have thousands of non well-formed XML files to patch up.

Many of them contain the following issue: <someTag attr='text [< 99]'/> (note left angle in square brackets).

I would like to write a sed expression to replace all instances of [< with [&lt; for *.xml.

sed -n 19p myFile.xml returns <someTag attr='text [<99]'/> as expected.

echo '[<45' | sed -n '/\[</p' returns [<45 as expected.

However, sed -n '/\[</p' myFile.xml returns nothing so apparently I need a different syntax when using that expression against a file as opposed to echo. What syntax do I need to use?

Also, once I have this done, my plan is to do something like

sed -i -n 's/correct expression/\[&lt;/g/p' *.xml to run it against all matches in all files and output the new version to help me debug. Does that seem reasonable?

BTW, sed seemed like the tool to use, but I'm perfectly fine using any other solution that runs on Linux.

Thanks!


Solution

  • However, sed -n '/\[</p' myFile.xml returns nothing so apparently I need a different syntax when using that expression against a file as opposed to echo.

    Hm, works for me:

    echo '[<45' > test.xml
    sed -n '/\[</p' test.xml
    

    returns [<45.

    That said, if you want to replace, do something like

    sed 's/\[</[\&lt;/g'
    

    For example, to modify all xml files directly, do

    sed -i 's/\[</[\&lt;/g' *.xml
    

    (the -i switch is for directly modifying the files; otherwise, their contents will be sent to stdout)

    Does that seem reasonable?

    Sure, that is what sed is for.