Search code examples
redhatfile-formatwindows-1252

Formating file changes encoding on Redhat system


I have a bash script which extract data from an oracle database. I use spool to extract data. After extraction I format the file by removing and replacing some characters. My problem is after formating the files are in ANSI encoding instead of ut8.

  1. Extraction with spool. The file is utf8
  2. Format with cat and tr command and redirect in another file. This file is ansi.

The same process works fine on Aix system. I try iconv but it doesnt work. Do you please have an idea why the encoding changes from utf8 to ansi ? How to correct it please ?


Solution

  • You should consequently use either ISO-8859-1 or UTF-8. In the latter case, don't use tr as it doesn't (yet?) support multi-byte characters, use sed instead (e.g sed 's/deletethis//g').

    ISO-8859-1:

    export LC_CTYPE=fr_FR.ISO-8859-1
    export NLS_LANG=French_France.WE8ISO8859P1
    
    # fetch data from Oracle, emulated by the following line
    echo 'âêîôû' >test.latin1 # 5 bytes (+lineend)
    
    # perform formatting, eg:
    sed 's/ê/[e-circumflex]/g' test.latin1
    
    # or the same with hex-codes:
    sed $'s/\xea/[e-circumflex]/g' test.latin1
    

    UTF-8:

    export LC_CTYPE=fr_FR.UTF-8
    export NLS_LANG=French_France.AL32UTF8
    
    # fetch data from Oracle, emulated by the following line
    echo 'âêîôû' >test.utf8 # 10 bytes (+lineend)
    
    # perform formatting, eg:
    sed 's/ê/[e-circumflex]/g' test.utf8
    
    # or the same with hex-codes:
    sed $'s/\xc3\xaa/[e-circumflex]/g' test.utf8
    

    Note: no conversion (iconv, recode, etc) is required, just make sure NLS_LANG and LC_CTYPE are compatible. (Also, your terminal(emulator) should be set accordingly; for PuTTY it is Configuration/Category/Window/Translation/Remote-character-set.)

    Original answer:

    I cannot tell what's wrong with the formatting you perform, but here is a method to damage the utf8-encoded text:

    $ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | xxd
    00000000: c381 5256 c38d 5a54 c5b0 52c5 9020 54c3  ..RV..ZT..R.. T.
    00000010: 9c4b c396 5246 c39a 52c3 9347 c389 500a  .K..RF..R..G..P.
    
    $ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | tr -d $'\200-\237' | xxd
    00000000: c352 56c3 5a54 c5b0 52c5 2054 c34b c352  .RV.ZT..R. T.K.R
    00000010: 46c3 52c3 47c3 500a                      F.R.G.P.
    

    Here the tr -d $'\200-\237' part deleted half of the utf8-sequences (c381 became c3, c590 became c5), rendering the text unusable.