Say I have a string like this:
Output:
I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"
I want to only remove non-alphanumeric characters inside the quotations except commas, periods, or spaces:
Desired Output:
I have some-non-alphanumeric % characters remain here, I "also, have some .here"
I have tried the following sed
command matching a string and deleting inside the quotes, but it deletes everything that is inside the quotes including the quotes:
sed '/characters/ s/\("[^"]*\)\([^a-zA-Z0-9\,\. ]\)\([^"]*"\)//g'
Any help is appreciated, preferably using sed
, to get the desired output. Thanks in advance!
You need to repeat your substitution multiple times to remove all non-alphanumeric characters. Doing such a loop in sed requires a label and use of the b
and t
commands:
sed '
# If the line contains /characters/, just to label repremove
/characters/ b repremove
# else, jump to end of script
b
# labels are introduced with colons
:repremove
# This s command says: find a quote mark and some stuff we do not want
# to remove, then some stuff we do want to remove, then the rest until
# a quote mark again. Replace it with the two things we did not want to
# remove
s/\("[a-zA-Z0-9,. ]*\)[^"a-zA-Z0-9,. ][^"a-zA-Z0-9,. ]*\([^"]*"\)/\1\2/
# The t command repeats the loop until we have gotten everything
t repremove
'
(This will work even without the [^"a-zA-Z0-9,. ]*
, but it'll be slower on lines that contain many non-alphanumeric characters in a row)
Though the other answer is right in that doing this in perl is much easier.