Forgive me for the clumsy way I'm approaching this question, everything I've learnt so far on the topic of character encoding has been in the last few hours and I'm aware I'm out of my depth. This may be answered elsewhere on the site, such as in my linked questions, but if it has, those answers are too dense for me to understand exactly what's being concluded in them.
I often need to grep
through folders of excessively large text files (totalling more than 100GB). I've read about how using LC_ALL=C
can speed this up considerably, but I want to be sure that doing so won't compromise the accuracy of my searches.
The files are old and have passed through many different online sources, so are likely to contain a jumble of characters from many different encodings, including UTF-8. (As an aside, is it possible for a single file to contain characters from multiple encodings?)
The bulk of what concerns me is this: if I want to search for a given b
in my data, can I expect every letter b
that's present in the data to be encoded as ASCII, or can the same letter also be encoded as UTF-8?
Or to put it another way, are ASCII characters always and exclusively ASCII? If even standard English characters can be encoded as UTF-8, and using LC_ALL=C grep
would disregard all UTF-8 characters, then this would have the implication that my searches would miss search terms that are not in ASCII, which would obviously not be the behaviour that I want, and would be a considerable obstacle to adopting LC_ALL=C
for grep
.
About understanding UTF-8 vs ASCII, the following are very good
http://kunststube.net/encoding/
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
About difference in time with grep for UTF-8 files with small amount of not ASCII character, there is basically no difference using LC_ALL=C or LANG=C versus the standard LANG=en_US.UTF-8 or similar.
Test performed on Cygwin 64 bit, repeating 1000 times the search on 20GB of text:
$ time for i in $(seq 1000) ; do grep -q LAPTOP-82F08ILC wia-*.log ; done
real 0m53.289s
user 0m7.813s
sys 0m31.635s
$ time for i in $(seq 1000) ; do LC_ALL=C grep -q LAPTOP-82F08ILC wia-*.log ; done
real 0m53.027s
user 0m7.497s
sys 0m31.010s
s
$ ls -sh wia-*
10G wia-1024.log 160M wia-16.log 2.5G wia-256.log 40M wia-4.log 639M wia-64.log
1.3G wia-128.log 20M wia-2.log 320M wia-32.log 5.0G wia-512.log 80M wia-8.log
The difference is within the tolerance of repeatition that was in the 53-55 seconds for both cases