I have a list of files offloaded from oceanographic instruments. For some reason, there is occasionally a non-ASCII character inserted where an ASCII character should be. I have found grave-E (È) where there should be a W to denote the western hemisphere in longitude records.
Here's what the data looks like:
CUMSECS Date UTC Time UTC Date Local Time local Z (m) Target Z Z Bot Temp PAR Salin Ang VelX Ang VelY Ang VelZ Pump + Pump - Gctr Fix secs Date UTC Time UTC Date Local Time Local Lat LatD Latm Lon LonD Lonm DOP Temp PAR Salin Batt V CMD secs Date Local Time Local No. Cmds
526068034 09/01/16 18:00:34 09/01/16 11:00:34 3.75 2.69
3.75 0.29 0.000000 0.00 -12 -70 -50 0 5 10
526068039 09/01/16 18:00:39 09/01/16 11:00:39 3.75 2.69
3.75 0.29 0.000000 0.00 -12 -70 -50 0 5 10
526068044 09/01/16 18:00:44 09/01/16 11:00:44 3.74 2.69
3.75 0.29 0.000000 0.00 -12 -70 -50 0 5 10
526068049 09/01/16 18:00:49 09/01/16 11:00:49 3.73 2.69
3.75 0.29 0.000000 0.00 -30732 13588 31909 60399 7538 -82
543622771 03/23/17 22:19:31 03/23/17 15:19:31 38.31877 38
19.1262 N 123.07136 123 4.2812 È 23.6 115.06 0.0000 96.00
121.718
547764151 05/10/17 20:42:31 05/10/17 13:42:31 0.03 16.00
127.00 13.68 1074.904320 33.56 -4908 -3976 261 1 0 0
547764152 05/10/17 20:42:32 05/10/17 13:42:32 0.00 16.00
127.00 13.68 1074.904320 33.56 -4908 -3976 261 1 0 0
I can find the non-ASCII characters using the following Bash line
pcregrep -n '[^\x00-\x7F]' 170510_ocean_Copepod.txt
I would like to loop through a series of files, find these characters, and replace them with a 'W' so that I can subsequently read them into R and process them en masse. Alternatively, a workaround to the error returned by R in trying to read these files ("multibyte string in location...") would be equally effective for my purposes. Any help much appreciated.
I think the problem is that È
in utf-8 is a multibyte character consisting of \xc3
and \x88
and sed
can't seem to deal with that for whatever reason. As @Jack suggested, tr
might be a better tool for the job (tested in bash for windows which doesn't have pcregrep):
user@PC:~$ grep -P '[^\x00-\x7f]' | tr 'È' 'W'
19.1262 N 123.07136 123 4.2812 WW 23.6 115.06 0.0000 96.00
Notice that it does convert both bytes separately to W
.
Another method could be to convert the whole file using iconv
. iso-8859-15 (latin-9) is one example of single-byte character encoding. The command to convert the file using iconv
would be:
iconv -f utf-8 -t iso-8859-15 -o <converted-file> <input-file>