I am trying to accomplish the following:
For an arbitrary Perl string (whether or not it is internally encoded in UTF-8, and whether or not it has the UTF-8 flag set), scan the string from left to right, and for every character, print the Unicode code point for that character in hex format. To make myself absolutely clear: I do not want to print UTF-8 byte sequences or something; I just would like to print the Unicode code point for every character in the string.
At first, I have come up with the following solution:
#!/usr/bin/perl -w
use warnings;
use utf8;
use feature 'unicode_strings';
binmode(STDOUT, ':encoding(UTF-8)');
binmode(STDIN, ':encoding(UTF-8)');
binmode(STDERR, ':encoding(UTF-8)');
$Text = "\x{3B1}\x{3C9}";
print $Text."\n";
printf "%vX\n", $Text;
# Prints the following to the console (the console is UTF8):
# αω
# 3B1.3C9
Then I have seen some examples, but without reasonable explanations, which made me doubt that my solution is correct, and now I have got questions regarding my own solution as well as the examples.
1) Perl's documentation about the v flag in (...)printf says:
"This flag tells Perl to interpret the supplied string as a vector of integers, one for each character in the string. [...]"
It does not say what it exactly means by "a vector of integers", though. When looking at the output of my example, it seems that those integers are the Unicode code points, but I would like to have this confirmed by somebody who knows for sure.
Hence the question:
1) Can we be sure that every integer which is pulled from the string that way is the respective character's Unicode code point (and not some other byte sequence)?
Secondly, regarding an example which I have found (slightly modified; I can't remember where I got it from, maybe from the Perl docs):
#!/usr/bin/perl -w
use warnings;
use utf8;
use feature 'unicode_strings';
binmode(STDOUT, ':encoding(UTF-8)');
binmode(STDIN, ':encoding(UTF-8)');
binmode(STDERR, ':encoding(UTF-8)');
$Text = "\x{3B1}\x{3C9}";
print $Text."\n";
printf "%vX\n", $Text for unpack('C0A*', $Text);
# Prints the following to the console (the console is UTF8):
# αω
# 3B1.3C9
Being a C and assembly guy, I just don't get why somebody would write the printf
statement like shown in the example. According to my understanding, the respective line is syntactically equivalent to:
for $_ (unpack('C0A*', $Text)) {
printf "%vX\n", $Text;
}
As far as I have understood, unpack()
takes $Text
, unpacks it (whatever that means in detail) and returns a list which in this case has one element, namely the unpacked string. Then $_ runs through that list with one element (without being used anywhere), hence the block (i.e. the printf()
) is executed once. In summary, the only action which is done by the above snippet is executing printf "%vX\n", $Text;
one time.
Hence the question:
2) What could be the reason for wrapping this into a for loop like shown in the example?
Final questions:
3) If the answer to question 1) is "yes", why do most examples I have seen use unpack()
after all?
4) In the three line snippet above, the parentheses which surround the unpack()
are necessary (leaving them away leads to syntax errors). In contrast, in the example, the unpack()
does not need to be enclosed in parentheses (but it does not harm if they are added nevertheless). Could anybody explain the reason?
Edit / Update in reply to ikegami's answer below:
Of course, I know that strings are sequences of integers. But
a) There are many different encodings for those integers, and the bytes which are in a certain string's memory area depend on the encoding, i.e. if I have two strings which contain exactly the same character sequence, but I store them in memory using different encodings, the byte sequences at the strings' memory locations are different.
b) I strongly suppose that (besides Unicode) there are many other systems / standards which map characters to integers / code points. For example, the Unicode code point 0x3B1 is the Greek letter α, but in some other system, it may be the German letter Ö.
Under these circumstances, the question makes perfect sense IMHO, but I possibly should be more precise and reword it:
If I have a string $Text
which only contains characters which are Unicode code points, and if I then execute printf "%vX\n", $Text;
, will it print the Unicode code point in hex for every character under all circumstances, notably (but not limited to):
use 'unicode_strings'
is activeIf the answer is yes, what sense do all the examples make which are using unpack()
, notably the example above? By the way, I now have remembered where I got that one from: the original form is in Perl's pack()
documentation, in the section about the C0 and U0 mode. Since they are using unpack()
, there must be a good reason for doing so.
Edit / Update No. 2
I have done further research. The following proves that the UTF8 flag plays an important role:
use Encode;
use Devel::Peek;
$Text = "\x{3B1}\x{3C9}";
Dump $Text;
printf("\nSPRINTF: %vX\n", $Text);
print("UTF8 flag: ".((Encode::is_utf8($Text)) ? "TRUE" : "FALSE")."\n\n");
Encode::_utf8_off($Text);
Dump $Text;
printf "\nSPRINTF: %vX\n", $Text;
print("UTF8 flag: ".((Encode::is_utf8($Text)) ? "TRUE" : "FALSE")."\n\n");
# This prints the following lines:
#
# SV = PV(0x1750c20) at 0x1770530
# REFCNT = 1
# FLAGS = (POK,pPOK,UTF8)
# PV = 0x17696b0 "\316\261\317\211"\0 [UTF8 "\x{3b1}\x{3c9}"]
# CUR = 4
# LEN = 16
#
# SPRINTF: 3B1.3C9
# UTF8 flag: TRUE
#
# SV = PV(0x1750c20) at 0x1770530
# REFCNT = 1
# FLAGS = (POK,pPOK)
# PV = 0x17696b0 "\316\261\317\211"\0
# CUR = 4
# LEN = 16
#
# SPRINTF: CE.B1.CF.89
# UTF8 flag: FALSE
We can see that _utf_off
indeed removes the UTF8 flag, but leaves the string's bytes untouched. sprintf()
with v flag outputs different results, solely dependent on the string's UTF8 flag even if the string's bytes remain the same.
sprintf '%vX'
has no knowledge of code points or UTF-8. It just returns a string representation of the characters of the string. In other words,
sprintf('%vX', $s)
is equivalent to
join('.', map { sprintf('%X', ord($_)) } split(//, $s))
That means it output s[0]
, s[1]
, s[2]
, ..., s[length(s)-1]
, in hex, separated by dots.
It returns the characters (integers) of the string regardless of the state of the UTF8
flag. That means that how the string is stored (e.g. whether the UTF8
flag is set or not) has no effect on the output.
use Encopde;
$Text1 = "\xC9ric";
utf8::downgrade($Text2);
printf("Text1 is a string of %1\$d characters (a vector of %1\$d integers)\n",
length($Text1));
print("UTF8 flag: ".((Encode::is_utf8($Text2)) ? "TRUE" : "FALSE")."\n");
printf("SPRINTF: %vX\n\n", $Text1);
$Text2 = $Text1;
utf8::upgrade($Text2);
print($Text1 eq $Text2
? "Text2 is identical to Text1\n\n"
: "Text2 differs from Text1\n\n");
printf("Text2 is a string of %1\$d characters (a vector of %1\$d integers)\n",
length($Text2));
print("UTF8 flag: ".((Encode::is_utf8($Text2)) ? "TRUE" : "FALSE")."\n");
printf "SPRINTF: %vX\n\n", $Text2;
Output:
Text1 is a string of 4 characters (a vector of 4 integers)
UTF8 flag: FALSE
SPRINTF: C9.72.69.63
Text2 is identical to Text1
Text2 is a string of 4 characters (a vector of 4 integers)
UTF8 flag: TRUE
SPRINTF: C9.72.69.63
Let's change the code in your question to show relevant information:
use Encode;
$Text1 = "\x{3B1}\x{3C9}";
printf("Text1 is a string of %1\$d characters (a vector of %1\$d integers)\n",
length($Text1));
printf("SPRINTF: %vX\n\n", $Text1);
$Text2 = $Text1;
Encode::_utf8_off($Text2);
print($Text1 eq $Text2
? "Text2 is identical to Text1\n\n"
: "Text2 differs from Text1\n\n");
printf("Text2 is a string of %1\$d characters (a vector of %1\$d integers)\n",
length($Text2));
printf "SPRINTF: %vX\n\n", $Text2;
Output:
Text1 is a string of 2 characters (a vector of 2 integers)
SPRINTF: 3B1.3C9
Text2 differs from Text1
Text2 is a string of 4 characters (a vector of 4 integers)
SPRINTF: CE.B1.CF.89
It shows that sprintf '%vX'
will have different output for different strings, which is no surprise, since sprintf '%vX'
simply outputs the characters of the string. You could just as easly have used uc
instead of _utf8_off
.
sprintf '%vX'
altererd its output based on the UTF8
flag, it would be considered to suffer from The Unicode Bug. Most instances of those has been fixed (though sprintf
never suffered from this bug).