Search code examples
regexbashsedalphanumericnon-alphanumeric

sed: remove all non-alphanumeric characters inside quotations only


Say I have a string like this:

Output:   
I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"

I want to only remove non-alphanumeric characters inside the quotations except commas, periods, or spaces:

Desired Output:    
I have some-non-alphanumeric % characters remain here, I "also, have some  .here"

I have tried the following sed command matching a string and deleting inside the quotes, but it deletes everything that is inside the quotes including the quotes:

sed '/characters/ s/\("[^"]*\)\([^a-zA-Z0-9\,\. ]\)\([^"]*"\)//g'

Any help is appreciated, preferably using sed, to get the desired output. Thanks in advance!


Solution

  • You need to repeat your substitution multiple times to remove all non-alphanumeric characters. Doing such a loop in sed requires a label and use of the b and t commands:

    sed '
    # If the line contains /characters/, just to label repremove
    /characters/ b repremove
    # else, jump to end of script
    b
    # labels are introduced with colons
    :repremove
    # This s command says: find a quote mark and some stuff we do not want
    # to remove, then some stuff we do want to remove, then the rest until
    # a quote mark again. Replace it with the two things we did not want to
    # remove
    s/\("[a-zA-Z0-9,. ]*\)[^"a-zA-Z0-9,. ][^"a-zA-Z0-9,. ]*\([^"]*"\)/\1\2/
    # The t command repeats the loop until we have gotten everything
    t repremove
    '
    

    (This will work even without the [^"a-zA-Z0-9,. ]*, but it'll be slower on lines that contain many non-alphanumeric characters in a row)

    Though the other answer is right in that doing this in perl is much easier.