Search code examples
utf-8character-encodingcygwintext-filesascii

Understanding LC_ALL=C and its implications for standard English characters


Forgive me for the clumsy way I'm approaching this question, everything I've learnt so far on the topic of character encoding has been in the last few hours and I'm aware I'm out of my depth. This may be answered elsewhere on the site, such as in my linked questions, but if it has, those answers are too dense for me to understand exactly what's being concluded in them.

I often need to grep through folders of excessively large text files (totalling more than 100GB). I've read about how using LC_ALL=C can speed this up considerably, but I want to be sure that doing so won't compromise the accuracy of my searches.

The files are old and have passed through many different online sources, so are likely to contain a jumble of characters from many different encodings, including UTF-8. (As an aside, is it possible for a single file to contain characters from multiple encodings?)

The bulk of what concerns me is this: if I want to search for a given b in my data, can I expect every letter b that's present in the data to be encoded as ASCII, or can the same letter also be encoded as UTF-8?

Or to put it another way, are ASCII characters always and exclusively ASCII? If even standard English characters can be encoded as UTF-8, and using LC_ALL=C grep would disregard all UTF-8 characters, then this would have the implication that my searches would miss search terms that are not in ASCII, which would obviously not be the behaviour that I want, and would be a considerable obstacle to adopting LC_ALL=C for grep.


Solution

  • About understanding UTF-8 vs ASCII, the following are very good
    http://kunststube.net/encoding/ https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

    About difference in time with grep for UTF-8 files with small amount of not ASCII character, there is basically no difference using LC_ALL=C or LANG=C versus the standard LANG=en_US.UTF-8 or similar.

    Test performed on Cygwin 64 bit, repeating 1000 times the search on 20GB of text:

    $ time for i in $(seq 1000) ; do  grep -q LAPTOP-82F08ILC wia-*.log ; done
    
    real    0m53.289s
    user    0m7.813s
    sys     0m31.635s
    
    $ time for i in $(seq 1000) ; do  LC_ALL=C grep -q LAPTOP-82F08ILC wia-*.log ; done
    
    real    0m53.027s
    user    0m7.497s
    sys     0m31.010s
    s
    
        $ ls -sh wia-*
         10G wia-1024.log  160M wia-16.log  2.5G wia-256.log   40M wia-4.log    639M wia-64.log
        1.3G wia-128.log    20M wia-2.log   320M wia-32.log   5.0G wia-512.log   80M wia-8.log
    

    The difference is within the tolerance of repeatition that was in the 53-55 seconds for both cases