Here is my bash script for inserting parquets in parallel to clickhouse. It keeps giving me the error I put in the title though and I don't know why. Any help is appreciated
#!/bin/bash
time (for FILENAME in /mnt/sdc/traces/part-*.snappy.parquet; do
echo $FILENAME
xargs -P 6 -n 1 -0 clickhouse-client --receive_timeout=100000 --query=\"INSERT INTO ethereum.traces FORMAT Parquet\" < $FILENAME
done)
One way to implement this would look like:
#!/bin/bash
cpu_count=6
batch_size=4
printf '%s\0' /mnt/sdc/traces/part-*.snappy.parquet |
xargs -P"$cpu_count" -n"$batch_size" -0 sh -c '
for filename in "$@"; do
echo "$filename"
clickhouse-client --receive_timeout=100000 --query="INSERT INTO ethereum.traces FORMAT Parquet" <"$filename"
done
' _
xargs
requires its stdin to be a list of arguments to pass to the program it invokes. That wasn't the case at all in your original code, which was passing xargs parquet files directly on its stdin -- whereas here, we're passing it a NUL-delimited list of names of parquet files.-n
argument to xargs
tells it how many files to pass to each copy of sh
. Using a low number like 1 reduces the chance that you won't be parallelizing well when the number of files left is below the batch size, but increases the performance overhead of starting up new shells.