Suppose that I have a file (in UTF-8) that contains a line like this:
๐ฅโ๐+๐
There are five characters in this line:
Does gawk
allow to convert this line to a sequence of integers (in a decimal base) corresponding to five Unicode code points โ that is, 119909, 8776, 119898, 43, 119899
? If yes, how to do this? And vice versa, given a string like 119909, 8776, 119898, 43, 119899
, how to interpret it as a sequence of Unicode code points (in a decimal base) and convert to the sequence of corresponding characters (๐ฅโ๐+๐
)?
EDIT
Reformulating the first problem: is gawk
able to output the binary representation of a given string? For example, consider the string รก ร
. Its UTF-8 encoding is 0xc3 0xa1 0x20 0xc3 0xa0
. Is it possible to convert this three-character string to a sequence of five numbers (in the range of 0-255) that correspond to five bytes?
Using the encoding table from https://en.wikipedia.org/wiki/UTF-8 and not aiming for any kind of efficiency, nor avoidance of redundant code, nor doing any checks that the input is valid UTF-8:
LC_ALL=C gawk '
BEGIN { for(i=0;i<256;++i) ord[sprintf("%c",i)]=i }
{
n=split($0,a,//)
for(i=1;i<=n;){
A = ord[ b1=a[i++] ]
B = (A>191) ? ord[ b2=a[i++] ] : 0
C = (A>223) ? ord[ b3=a[i++] ] : 0
D = (A>239) ? ord[ b4=a[i++] ] : 0
if (D) {
A10 = A%4
A2 = (int(A/4))%2
B3210 = B%16
B54 = (int(B/16))%4
C10 = C%4
C5432 = (int(C/4))%16
D3210 = D%16
D54 = (int(D/16))%4
cpint = (1048576*A2) + (65536*((4*A10)+B54)) + (4096*B3210) + (256*C5432) + (16*((4*C10)+D54)) + (D3210)
bytes = b1 b2 b3 b4
bytehex = sprintf("0x%02x 0x%02x 0x%02x 0x%02x", A,B,C,D)
} else if (C) {
A3210 = A%16
B10 = B%4
B5432 = (int(B/4))%16
C3210 = C%16
C54 = (int(C/16))%4
cpint = (4096*A3210) + (256*B5432) + (16*((4*B10)+C54)) + (C3210)
bytes = b1 b2 b3
bytehex = sprintf("0x%02x 0x%02x 0x%02x", A,B,C)
} else if (B) {
A10 = A%4
A432 = (int(A/4))%8
B3210 = B%16
B54 = (int(B/16))%4
cpint = (256*A432) + (16*((4*A10)+B54)) + (B3210)
bytes = b1 b2
bytehex = sprintf("0x%02x 0x%02x", A,B)
} else {
cpint = A
bytes = b1
bytehex = sprintf("0x%02x", A)
}
printf "%s\t= U+%04X\t= %-12d: %s\n", bytes, cpint, cpint, bytehex
}
print "--------"
}
' file.txt
Running on input:
รก ร
GBP ยฃ10
๐ฅโ๐+๐
ํฅโํ+ํ
๐ถโ๐ซ๏ธ
๐ฌ๐ง
ยฏ\_(ใ)_/ยฏ
ุงููู ูู ุงูู
ุฌู
ู ุนุฉ (5)
gives:
รก = U+00E1 = 225 : 0xc3 0xa1
= U+0020 = 32 : 0x20
ร = U+00E0 = 224 : 0xc3 0xa0
--------
G = U+0047 = 71 : 0x47
B = U+0042 = 66 : 0x42
P = U+0050 = 80 : 0x50
= U+0020 = 32 : 0x20
ยฃ = U+00A3 = 163 : 0xc2 0xa3
1 = U+0031 = 49 : 0x31
0 = U+0030 = 48 : 0x30
--------
๐ฅ = U+1D465 = 119909 : 0xf0 0x9d 0x91 0xa5
โ = U+2248 = 8776 : 0xe2 0x89 0x88
๐ = U+1D45A = 119898 : 0xf0 0x9d 0x91 0x9a
+ = U+002B = 43 : 0x2b
๐ = U+1D45B = 119899 : 0xf0 0x9d 0x91 0x9b
--------
ํฅ = U+D465 = 54373 : 0xed 0x91 0xa5
โ = U+2248 = 8776 : 0xe2 0x89 0x88
ํ = U+D45A = 54362 : 0xed 0x91 0x9a
+ = U+002B = 43 : 0x2b
ํ = U+D45B = 54363 : 0xed 0x91 0x9b
--------
๐ถ = U+1F636 = 128566 : 0xf0 0x9f 0x98 0xb6
โ = U+200D = 8205 : 0xe2 0x80 0x8d
๐ซ = U+1F32B = 127787 : 0xf0 0x9f 0x8c 0xab
๏ธ = U+FE0F = 65039 : 0xef 0xb8 0x8f
--------
๐ฌ = U+1F1EC = 127468 : 0xf0 0x9f 0x87 0xac
๐ง = U+1F1E7 = 127463 : 0xf0 0x9f 0x87 0xa7
--------
ยฏ = U+00AF = 175 : 0xc2 0xaf
\ = U+005C = 92 : 0x5c
_ = U+005F = 95 : 0x5f
( = U+0028 = 40 : 0x28
ใ = U+30C4 = 12484 : 0xe3 0x83 0x84
) = U+0029 = 41 : 0x29
_ = U+005F = 95 : 0x5f
/ = U+002F = 47 : 0x2f
ยฏ = U+00AF = 175 : 0xc2 0xaf
--------
ุง = U+0627 = 1575 : 0xd8 0xa7
ู = U+0644 = 1604 : 0xd9 0x84
ู = U+0643 = 1603 : 0xd9 0x83
ู = U+0644 = 1604 : 0xd9 0x84
= U+0020 = 32 : 0x20
ู = U+0641 = 1601 : 0xd9 0x81
ู = U+064A = 1610 : 0xd9 0x8a
= U+0020 = 32 : 0x20
ุง = U+0627 = 1575 : 0xd8 0xa7
ู = U+0644 = 1604 : 0xd9 0x84
ู
= U+0645 = 1605 : 0xd9 0x85
ุฌ = U+062C = 1580 : 0xd8 0xac
ู
= U+0645 = 1605 : 0xd9 0x85
ู = U+0648 = 1608 : 0xd9 0x88
= U+0020 = 32 : 0x20
ุน = U+0639 = 1593 : 0xd8 0xb9
ุฉ = U+0629 = 1577 : 0xd8 0xa9
= U+0020 = 32 : 0x20
( = U+0028 = 40 : 0x28
5 = U+0035 = 53 : 0x35
) = U+0029 = 41 : 0x29
--------