Search code examples
unicodesedtext-filesnon-printing-characters

How do I get rid of this unicode character?


Any idea how to get rid of this irritating character U+0092 from a bunch of text files? I've tried all the below but it doesn't work. It's called U+0092+control from the character map

sed -i 's/\xc2\x92//' *
sed -i 's/\u0092//' *
sed -i 's///' *

Ah, I've found a way:

CHARS=$(python2 -c 'print u"\u0092".encode("utf8")')
sed 's/['"$CHARS"']//g'

But is there a direct sed method for this?


Solution

  • Try sed "s/\`//g" *. (I added the g so it will remove all the backticks it finds).


    EDIT: It's not a backtick that OP wants to remove.

    Following the solution in this question, this ought to work:

    sed 's/\xc2\x92//g'
    

    To demonstrate it does:

    $ CHARS=$(python -c 'print u"asdf\u0092asdf".encode("utf8")')
    
    $ echo $CHARS
    asdf<funny glyph symbol>asdf
    
    $ echo $CHARS | sed 's/\xc2\x92//g'
    asdfasdf
    

    Seeing as it's something you tried already, perhaps what is in your text file is not U+0092?