Search code examples
xmlreplacesedplaceholderstrip

strip out xml tags inside placeholders


I would like to use sed (or other tool) to strip out xml tags but only in specific locations, marked with '{{' '}}' placeholders. Example:

<ok><ok2>{{TextShouldStay<not_ok>this_should_be_out</not_ok>
<sthelse/>ThisShouldBeAgain}}</ok2></ok>

Expected result:

<ok><ok2>{{TextShouldStayThisShouldBeAgain}}</ok2></ok>

Any ideas how to achieve that?


Solution

  • Command:

    tr '\n' ' ' < file.xml | sed -r 's/(.*\{\{)([A-Za-z0-9]*)(<.*\/>)(.*)/\1\2\4\n/g'
    

    Output:

    sdlcb@Goofy-Gen:~/AMD$ cat file.xml
    <ok><ok2>{{TextShouldStay<not_ok>this_should_be_out</not_ok>
    <sthelse/>ThisShouldBeAgain}}</ok2></ok>
    sdlcb@Goofy-Gen:~/AMD$ tr '\n' ' ' < file.xml | sed -r 's/(.*\{\{)([A-Za-z0-9]*)(<.*\/>)(.*)/\1\2\4\n/g'
    <ok><ok2>{{TextShouldStayThisShouldBeAgain}}</ok2></ok>
    sdlcb@Goofy-Gen:~/AMD$
    
    
    Here we remove the newlines first using 'tr' and then group the patterns using '(' and ')'. 
    First group - from beginning of line to '{{' inclusive
    Second group - after '{{', whatever alphabets/numbers
    Third group - characters between the next '<' and last '/>'
    Fourth group - remaining characters.
    
    Once grouped, we remove the 3rd pattern group, also add newline.