Search code examples
jsonbashvariablesjq

How to use non-displaying characters like newline (\n) and tab (\t) with jq's "join" function


I couldn't find this anywhere on the internet, so figured I'd add it as documentation.

I wanted to join a json array around the non-displaying character \30 ("RecordSeparator") so I could safely iterate over it in bash, but I couldn't quite figure out how to do it. I tried echo '["one","two","three"]' | jq 'join("\30")' and a couple permutations of that, but it didn't work.

Turns out the solution is pretty simple.... (See answer)


Solution

  • Use jq -j to eliminate literal newlines between records and use only your own delimiter. This works in your simple case:

    #!/usr/bin/env bash
    data='["one","two","three"]'
    sep=$'\x1e' # works only for non-NUL characters, see NUL version below
    while IFS= read -r -d "$sep" rec || [[ $rec ]]; do
      printf 'Record: %q\n' "$rec"
    done < <(jq -j --arg sep "$sep" 'join($sep)' <<<"$data")
    

    ...but it also works in a more interesting scenario where naive answers fail:

    #!/usr/bin/env bash
    data='["two\nlines","*"]'
    while IFS= read -r -d $'\x1e' rec || [[ $rec ]]; do
      printf 'Record: %q\n' "$rec"
    done < <(jq -j 'join("\u001e")' <<<"$data")
    

    returns (when run on Cygwin, hence the CRLF):

    Record: $'two\r\nlines'
    Record: \*
    

    That said, if using this in anger, I would suggest using NUL delimiters, and filtering them out from the input values:

    #!/usr/bin/env bash
    data='["two\nlines","three\ttab-separated\twords","*","nul\u0000here"]'
    while IFS= read -r -d '' rec || [[ $rec ]]; do
      printf 'Record: %q\n' "$rec"
    done < <(jq -j '[.[] | gsub("\u0000"; "@NUL@")] | join("\u0000")' <<<"$data")
    

    NUL is a good choice because it's a character than can't be stored in C strings (like the ones bash uses) at all, so there's no loss in the range of data which can be faithfully conveyed when they're excised -- if they did make it through to the shell, it would (depending on version) either discard them, or truncate the string at the point when one first appears.