Search code examples
awkunicode

How to convert a string to a sequence of integers corresponding to the Unicode code points, and vice versa?


Suppose that I have a file (in UTF-8) that contains a line like this:

๐‘ฅโ‰ˆ๐‘š+๐‘›

There are five characters in this line:

  1. U+1D465: MATHEMATICAL ITALIC SMALL X;
  2. U+2248: ALMOST EQUAL TO;
  3. U+1D45A: MATHEMATICAL ITALIC SMALL M;
  4. U+002B: PLUS SIGN;
  5. U+1D45B: MATHEMATICAL ITALIC SMALL N.

Does gawk allow to convert this line to a sequence of integers (in a decimal base) corresponding to five Unicode code points โ€” that is, 119909, 8776, 119898, 43, 119899? If yes, how to do this? And vice versa, given a string like 119909, 8776, 119898, 43, 119899, how to interpret it as a sequence of Unicode code points (in a decimal base) and convert to the sequence of corresponding characters (๐‘ฅโ‰ˆ๐‘š+๐‘›)?

EDIT

Reformulating the first problem: is gawk able to output the binary representation of a given string? For example, consider the string รก ร . Its UTF-8 encoding is 0xc3 0xa1 0x20 0xc3 0xa0. Is it possible to convert this three-character string to a sequence of five numbers (in the range of 0-255) that correspond to five bytes?


Solution

  • Using the encoding table from https://en.wikipedia.org/wiki/UTF-8 and not aiming for any kind of efficiency, nor avoidance of redundant code, nor doing any checks that the input is valid UTF-8:

    LC_ALL=C gawk '
       BEGIN { for(i=0;i<256;++i) ord[sprintf("%c",i)]=i }
       {
          n=split($0,a,//)
          for(i=1;i<=n;){
    
             A =           ord[ b1=a[i++] ]
             B = (A>191) ? ord[ b2=a[i++] ] : 0
             C = (A>223) ? ord[ b3=a[i++] ] : 0
             D = (A>239) ? ord[ b4=a[i++] ] : 0
    
             if (D) {
                A10 = A%4
                A2 = (int(A/4))%2
                B3210 = B%16
                B54 = (int(B/16))%4
                C10 = C%4
                C5432 = (int(C/4))%16
                D3210 = D%16
                D54 = (int(D/16))%4
                cpint = (1048576*A2) + (65536*((4*A10)+B54)) + (4096*B3210) + (256*C5432) + (16*((4*C10)+D54)) + (D3210)
                bytes = b1 b2 b3 b4
                bytehex = sprintf("0x%02x 0x%02x 0x%02x 0x%02x", A,B,C,D)
    
             } else if (C) {
                A3210 = A%16
                B10 = B%4
                B5432 = (int(B/4))%16
                C3210 = C%16
                C54 = (int(C/16))%4
                cpint = (4096*A3210) + (256*B5432) + (16*((4*B10)+C54)) + (C3210)
                bytes = b1 b2 b3
                bytehex = sprintf("0x%02x 0x%02x 0x%02x", A,B,C)
    
             } else if (B) {
                A10 = A%4
                A432 = (int(A/4))%8
                B3210 = B%16
                B54 = (int(B/16))%4
                cpint = (256*A432) + (16*((4*A10)+B54)) + (B3210)
                bytes = b1 b2
                bytehex = sprintf("0x%02x 0x%02x", A,B)
    
             } else {
                cpint = A
                bytes = b1
                bytehex = sprintf("0x%02x", A)
             }
         
             printf "%s\t= U+%04X\t= %-12d: %s\n", bytes, cpint, cpint, bytehex
          }
          print "--------"
       }
    ' file.txt
    

    Running on input:

    รก ร 
    GBP ยฃ10
    ๐‘ฅโ‰ˆ๐‘š+๐‘›
    ํ‘ฅโ‰ˆํ‘š+ํ‘›
    ๐Ÿ˜ถโ€๐ŸŒซ๏ธ
    ๐Ÿ‡ฌ๐Ÿ‡ง
    ยฏ\_(ใƒ„)_/ยฏ
    ุงู„ูƒู„ ููŠ ุงู„ู…ุฌู…ูˆ ุนุฉ (5)
    

    gives:

    รก       = U+00E1    = 225         : 0xc3 0xa1
            = U+0020    = 32          : 0x20
    ร        = U+00E0    = 224         : 0xc3 0xa0
    --------
    G       = U+0047    = 71          : 0x47
    B       = U+0042    = 66          : 0x42
    P       = U+0050    = 80          : 0x50
            = U+0020    = 32          : 0x20
    ยฃ       = U+00A3    = 163         : 0xc2 0xa3
    1       = U+0031    = 49          : 0x31
    0       = U+0030    = 48          : 0x30
    --------
    ๐‘ฅ      = U+1D465   = 119909      : 0xf0 0x9d 0x91 0xa5
    โ‰ˆ       = U+2248    = 8776        : 0xe2 0x89 0x88
    ๐‘š      = U+1D45A   = 119898      : 0xf0 0x9d 0x91 0x9a
    +       = U+002B    = 43          : 0x2b
    ๐‘›      = U+1D45B   = 119899      : 0xf0 0x9d 0x91 0x9b
    --------
    ํ‘ฅ       = U+D465    = 54373       : 0xed 0x91 0xa5
    โ‰ˆ       = U+2248    = 8776        : 0xe2 0x89 0x88
    ํ‘š       = U+D45A    = 54362       : 0xed 0x91 0x9a
    +       = U+002B    = 43          : 0x2b
    ํ‘›       = U+D45B    = 54363       : 0xed 0x91 0x9b
    --------
    ๐Ÿ˜ถ      = U+1F636   = 128566      : 0xf0 0x9f 0x98 0xb6
    โ€       = U+200D    = 8205        : 0xe2 0x80 0x8d
    ๐ŸŒซ      = U+1F32B   = 127787      : 0xf0 0x9f 0x8c 0xab
    ๏ธ       = U+FE0F    = 65039       : 0xef 0xb8 0x8f
    --------
    ๐Ÿ‡ฌ      = U+1F1EC   = 127468      : 0xf0 0x9f 0x87 0xac
    ๐Ÿ‡ง      = U+1F1E7   = 127463      : 0xf0 0x9f 0x87 0xa7
    --------
    ยฏ       = U+00AF    = 175         : 0xc2 0xaf
    \       = U+005C    = 92          : 0x5c
    _       = U+005F    = 95          : 0x5f
    (       = U+0028    = 40          : 0x28
    ใƒ„       = U+30C4    = 12484       : 0xe3 0x83 0x84
    )       = U+0029    = 41          : 0x29
    _       = U+005F    = 95          : 0x5f
    /       = U+002F    = 47          : 0x2f
    ยฏ       = U+00AF    = 175         : 0xc2 0xaf
    --------
    ุง       = U+0627    = 1575        : 0xd8 0xa7
    ู„       = U+0644    = 1604        : 0xd9 0x84
    ูƒ       = U+0643    = 1603        : 0xd9 0x83
    ู„       = U+0644    = 1604        : 0xd9 0x84
            = U+0020    = 32          : 0x20
    ู       = U+0641    = 1601        : 0xd9 0x81
    ูŠ       = U+064A    = 1610        : 0xd9 0x8a
            = U+0020    = 32          : 0x20
    ุง       = U+0627    = 1575        : 0xd8 0xa7
    ู„       = U+0644    = 1604        : 0xd9 0x84
    ู…       = U+0645    = 1605        : 0xd9 0x85
    ุฌ       = U+062C    = 1580        : 0xd8 0xac
    ู…       = U+0645    = 1605        : 0xd9 0x85
    ูˆ       = U+0648    = 1608        : 0xd9 0x88
            = U+0020    = 32          : 0x20
    ุน       = U+0639    = 1593        : 0xd8 0xb9
    ุฉ       = U+0629    = 1577        : 0xd8 0xa9
            = U+0020    = 32          : 0x20
    (       = U+0028    = 40          : 0x28
    5       = U+0035    = 53          : 0x35
    )       = U+0029    = 41          : 0x29
    --------