Search code examples
shellunicodeawk

how to split by unicode chars in shell


using java:

File file = new File("C:/Users/Administrator/Desktop/es.txt");
    List<String> lines = FileUtils.readLines(file, "utf-8");
    for (String line : lines) {
        String[] arr = line.split("\\u007C\\u001C");
        System.out.println(arr.length);
        System.out.println(Arrays.toString(arr));
    }

how can I do it in shell(awk, tr, or sed)? I've tried this, but it didn't work:

awk -F\u007c\u001c '{print $1}' es.txt

Thanks.


Solution

  • Obviously, U+007C and U+001C are plain old 7-bit ASCII characters, so splitting on those doesn't actually require any Unicode support (apart from possibly handling any ASCII-incompatible Unicode encoding in the files you are manipulating; but your question indicates that your data is in UTF-8, so that does not seem to be the case here. UTF-16 would require the splitting tool to be specifically aware of and compatible with the encoding).

    Assuming your question can be paraphrased as "if I know the numeric Unicode code point I want to split on, how do I pass that to a tool which is capable of splitting on it", my recommendation would be Perl.

    perl -CSD -aF'\N{U+1f4a9}' -nle 'print $F[0]' es.txt
    

    using U+1F4A9 as the separator. (Perl's arrays are zero-based, so $F[0] corresponds to Awk's $1. The -a option requests field splitting to the array @F; normally, Perl does not explicitly split the input into fields.) If the hex code for the code point you want to use as the field separator is in a shell variable, use double quotes instead of single, obviously.

    PIPE='007C'
    FS='001C'
    perl -CSD -aF"\N{U+$PIPE}\N{U+$FS}" -nle 'print $F[0]' es.txt
    

    Alternatively, if the tool you want to use handles UTF-8 transparently, you can use the ANSI C quoting facility of Bash to specify the separator. Unicode support seems only to have been introduced in Bash 4.2 so e.g. Debian Squeeze (currently oldoldstable) does not have it.

    awk -F$'\U0001f4a9' '{print $1}' es.txt  # or $'\u007c' for 4-digit code points
    

    However, because the quoting facility is a form of single quotes, you can't (easily) have the separator's code point value in a variable.