Search code examples
bashunixawksednon-printing-characters

Trying to remove non-printable characters (junk values) from a UNIX file


I am trying to remove non-printable character (for e.g. ^@) from records in my file. Since the volume to records is too big in the file using cat is not an option as the loop is taking too much time. I tried using

sed -i 's/[^@a-zA-Z 0-9`~!@#$%^&*()_+\[\]\\{}|;'\'':",.\/<>?]//g' FILENAME

but still the ^@ characters are not removed. Also I tried using

awk '{ sub("[^a-zA-Z0-9\"!@#$%^&*|_\[](){}", ""); print } FILENAME > NEW FILE 

but it also did not help.

Can anybody suggest some alternative way to remove non-printable characters?

Used tr -cd but it is removing accented characters. But they are required in the file.


Solution

  • Perhaps you could go with the complement of [:print:], which contains all printable characters:

    tr -cd '[:print:]' < file > newfile
    

    If your version of tr doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):

    sed 's/[^[:print:]]//g' file