Search code examples
rbashcommprocess-substitution

Calling comm from system() in R with process substitution


For efficiency reasons, I'd like to call comm in R via system(). I've grown accustomed to using syntax like:

comm -13 <(hadoop fs -cat /path/to/file | gunzip | awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{if($7 ~ /^".*"$/ && $9 ~ /^".*"$/) {print toupper($7),toupper($9)} else if($7 ~ /^[^"]/ && $9 ~ /^["]/) {print "\""toupper($7)"\"",toupper($9)} else if($7 ~ /^[^"]/ && $9 ~ /^[^"]/) {print "\""toupper($7)"\"","\""toupper($9)"\""}}' | sort) <(awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{if($1 ~ /^".*"$/ && $2 ~ /^".*"$/) {print toupper($1),toupper($2)} else if($1 ~ /^[^"]/ && $2 ~ /^["]/) {print "\""toupper($1)"\"",toupper($2)} else if($1 ~ /^[^"]/ && $2 ~ /^[^"]/) {print "\""toupper($1)"\"","\""toupper($2)"\""}}' /path/to/file | sort)

But when using this syntax from system, as in

system("comm -13 <(filea) <fileb)")

I get the familiar error:

sh: -c: line 0: syntax error near unexpected token `(' 

From the above it's clear that system() is using sh and not bash, and that process substitution isn't supported. After reading other articles, I've attempted using

system("bash -c 'comm -13 <(hadoop fs -cat /path/to/file | gunzip | awk -vFPAT='([^,]*)|(\"[^\"]+\")' -vOFS=, '{if($7 ~ /^\".*\"$/ && $9 ~ /^\".*\"$/) {print toupper($7),toupper($9)} else if($7 ~ /^[^\"]/ && $9 ~ /^[\"]/) {print \"\\\"\"toupper($7)\"\\\"\",toupper($9)} else if($7 ~ /^[^\"]/ && $9 ~ /^[^\"]/) {print \"\\\"\"toupper($7)\"\\\"\",\"\\\"\"toupper($9)\"\\\"\"}}' | sort) <(awk -vFPAT='([^,]*)|(\"[^\"]+\")' -vOFS=, '{if($1 ~ /^\".*\"$/ && $2 ~ /^\".*\"$/) {print toupper($1),toupper($2)} else if($1 ~ /^[^\"]/ && $2 ~ /^[\"]/) {print \"\\\"\"toupper($1)\"\\\"\",toupper($2)} else if($1 ~ /^[^\"]/ && $2 ~ /^[^\"]/) {print \"\\\"\"toupper($1)\"\\\"\",\"\\\"\"toupper($2)\"\\\"\"}}' /path/to/file | sort)")

That is, escaping double quotes and backslashes as necessary. However, this returns the same error:

sh: -c: line 0: syntax error near unexpected token `('

I'm guessing this has something to do with the escaping of single quotes within bash -c within a double quoted string in system(). I'm a little confused as to how to manage the single quoting within bash -c within a double quoted string in system(). How should I navigate all of this escaping?


Solution

  • To solve this issue, I merely needed to escape everything in within:

    bash -c "[within]"
    

    Using bash's escape rules (https://www.gnu.org/software/bash/manual/html_node/Double-Quotes.html), and everything in within2:

    system("[within2]")
    

    Using R's escape rules.

    The end result is double escaping backslashes and quotes (bash and R), and single escaping $ (bash).