using java:
File file = new File("C:/Users/Administrator/Desktop/es.txt");
List<String> lines = FileUtils.readLines(file, "utf-8");
for (String line : lines) {
String[] arr = line.split("\\u007C\\u001C");
System.out.println(arr.length);
System.out.println(Arrays.toString(arr));
}
how can I do it in shell(awk, tr, or sed)? I've tried this, but it didn't work:
awk -F\u007c\u001c '{print $1}' es.txt
Thanks.
Obviously, U+007C and U+001C are plain old 7-bit ASCII characters, so splitting on those doesn't actually require any Unicode support (apart from possibly handling any ASCII-incompatible Unicode encoding in the files you are manipulating; but your question indicates that your data is in UTF-8, so that does not seem to be the case here. UTF-16 would require the splitting tool to be specifically aware of and compatible with the encoding).
Assuming your question can be paraphrased as "if I know the numeric Unicode code point I want to split on, how do I pass that to a tool which is capable of splitting on it", my recommendation would be Perl.
perl -CSD -aF'\N{U+1f4a9}' -nle 'print $F[0]' es.txt
using U+1F4A9 as the separator. (Perl's arrays are zero-based, so $F[0]
corresponds to Awk's $1
. The -a
option requests field splitting to the array @F
; normally, Perl does not explicitly split the input into fields.) If the hex code for the code point you want to use as the field separator is in a shell variable, use double quotes instead of single, obviously.
PIPE='007C'
FS='001C'
perl -CSD -aF"\N{U+$PIPE}\N{U+$FS}" -nle 'print $F[0]' es.txt
Alternatively, if the tool you want to use handles UTF-8 transparently, you can use the ANSI C quoting facility of Bash to specify the separator. Unicode support seems only to have been introduced in Bash 4.2 so e.g. Debian Squeeze (currently oldoldstable) does not have it.
awk -F$'\U0001f4a9' '{print $1}' es.txt # or $'\u007c' for 4-digit code points
However, because the quoting facility is a form of single quotes, you can't (easily) have the separator's code point value in a variable.