I have a function that processes secondary structure data using awk and appends the result to a final output file. Even though I am using --line-buffer with GNU Parallel, I still occasionally get lines like the following in my output file:
4 GLN A 447 C 1 GLN A 1 T
Or sometimes:
4 GLN A 447
Multiple processes seem to be writing to the file simultaneously, causing lines to be merged or cut off. Here's the relevant part of my code:
calculate_secondary_structure() {
frame_counter=$1
# process only chain A, as it is polyQ
${stride_path}/stride ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb -ca -fsecondary_structure${frame_counter}.txt
rm ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb
awk -v frame_counter="$frame_counter" '
BEGIN { OFS=" " } # 10 spaces as a separator
/^ASG/ {
residue_name = substr($0, 6, 3)
chain_name = substr($0, 10, 1)
residue_number = substr($0, 17, 4)
ss_code = substr($0, 25, 1)
# Print the frame number followed by the extracted fields with 10 spaces between them
printf "%-10s %-10s %-10s %-10s %-10s\n", frame_counter, residue_name, chain_name, residue_number, ss_code
}' "secondary_structure${frame_counter}.txt" >> ${final_output_file} # Append directly to the final file
rm secondary_structure${frame_counter}.txt
}
export -f calculate_secondary_structure
seq 1 ${number_of_frames} | parallel --bar --line-buffer --block 1k --round-robin -j192 calculate_secondary_structure {}
Parallel-version:
GNU parallel 20240822
The problem is that you do not let GNU Parallel serialize the output, but you append directly to the $final_output_file
behind GNU Parallel's back.
So you probably want:
calculate_secondary_structure() {
frame_counter=$1
# process only chain A, as it is polyQ
${stride_path}/stride ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb -ca -fsecondary_structure${frame_counter}.txt
rm ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb
awk -v frame_counter="$frame_counter" '
BEGIN { OFS=" " } # 10 spaces as a separator
/^ASG/ {
residue_name = substr($0, 6, 3)
chain_name = substr($0, 10, 1)
residue_number = substr($0, 17, 4)
ss_code = substr($0, 25, 1)
# Print the frame number followed by the extracted fields with 10 spaces between them
printf "%-10s %-10s %-10s %-10s %-10s\n", frame_counter, residue_name, chain_name, residue_number, ss_code
}' "secondary_structure${frame_counter}.txt"
rm secondary_structure${frame_counter}.txt
}
export -f calculate_secondary_structure
seq 1 ${number_of_frames} |
parallel --bar --line-buffer -j192 calculate_secondary_structure {} >> ${final_output_file}
(--block 1k --round-robin
only make sense if you use --pipe
/--pipepart
. They do nothing otherwise. Also consider using -j100%
(or simply leave it out - as that is default) if you are running on a 192 core server. This way you do not need to change 192 when you get your new 512 core server).