Search code examples
rregexbashasciinon-ascii-characters

Bash/Linux Find non-ASCII character in a .txt file and replace it with an ASCII character


I have a list of files offloaded from oceanographic instruments. For some reason, there is occasionally a non-ASCII character inserted where an ASCII character should be. I have found grave-E (È) where there should be a W to denote the western hemisphere in longitude records.

Here's what the data looks like:

CUMSECS Date UTC    Time UTC    Date Local  Time local  Z (m)   Target Z    Z Bot   Temp    PAR Salin   Ang VelX    Ang VelY    Ang VelZ    Pump +  Pump -  Gctr    Fix secs    Date UTC    Time UTC    Date Local  Time Local  Lat LatD    Latm        Lon LonD    Lonm        DOP Temp    PAR Salin   Batt V      CMD secs    Date Local  Time Local  No. Cmds
526068034   09/01/16    18:00:34    09/01/16    11:00:34     3.75    2.69    
3.75     0.29    0.000000    0.00   -12 -70 -50 0   5   10
526068039   09/01/16    18:00:39    09/01/16    11:00:39     3.75    2.69    
3.75     0.29    0.000000    0.00   -12 -70 -50 0   5   10
526068044   09/01/16    18:00:44    09/01/16    11:00:44     3.74    2.69    
3.75     0.29    0.000000    0.00   -12 -70 -50 0   5   10
526068049   09/01/16    18:00:49    09/01/16    11:00:49     3.73    2.69    
3.75     0.29    0.000000    0.00   -30732  13588   31909   60399   7538    -82
543622771   03/23/17    22:19:31    03/23/17    15:19:31    38.31877    38  
19.1262 N   123.07136   123  4.2812 È   23.6    115.06     0.0000   96.00   
121.718 
547764151   05/10/17    20:42:31    05/10/17    13:42:31     0.03   16.00   
127.00  13.68   1074.904320 33.56   -4908   -3976   261 1   0   0
547764152   05/10/17    20:42:32    05/10/17    13:42:32     0.00   16.00   
127.00  13.68   1074.904320 33.56   -4908   -3976   261 1   0   0

I can find the non-ASCII characters using the following Bash line pcregrep -n '[^\x00-\x7F]' 170510_ocean_Copepod.txt

I would like to loop through a series of files, find these characters, and replace them with a 'W' so that I can subsequently read them into R and process them en masse. Alternatively, a workaround to the error returned by R in trying to read these files ("multibyte string in location...") would be equally effective for my purposes. Any help much appreciated.


Solution

  • I think the problem is that È in utf-8 is a multibyte character consisting of \xc3 and \x88 and sed can't seem to deal with that for whatever reason. As @Jack suggested, tr might be a better tool for the job (tested in bash for windows which doesn't have pcregrep):

    user@PC:~$ grep -P '[^\x00-\x7f]' | tr 'È' 'W'
    19.1262 N   123.07136   123  4.2812 WW   23.6    115.06     0.0000   96.00
    

    Notice that it does convert both bytes separately to W.

    Another method could be to convert the whole file using iconv. iso-8859-15 (latin-9) is one example of single-byte character encoding. The command to convert the file using iconv would be:

    iconv -f utf-8 -t iso-8859-15 -o <converted-file> <input-file>