Search code examples
gnu-parallel

Issue with concurrent output in GNU Parallel: Lines merging or truncated despite using --line-buffer


I have a function that processes secondary structure data using awk and appends the result to a final output file. Even though I am using --line-buffer with GNU Parallel, I still occasionally get lines like the following in my output file:

4          GLN        A           447       C 1          GLN        A             1       T

Or sometimes:

4          GLN        A           447

Multiple processes seem to be writing to the file simultaneously, causing lines to be merged or cut off. Here's the relevant part of my code:

calculate_secondary_structure() {
    frame_counter=$1
    # process only chain A, as it is polyQ
    ${stride_path}/stride ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb -ca -fsecondary_structure${frame_counter}.txt
    rm ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb

    awk -v frame_counter="$frame_counter" '
    BEGIN { OFS="          " } # 10 spaces as a separator
    /^ASG/ {
        residue_name = substr($0, 6, 3)
        chain_name = substr($0, 10, 1)
        residue_number = substr($0, 17, 4)
        ss_code = substr($0, 25, 1)

        # Print the frame number followed by the extracted fields with 10 spaces between them
        printf "%-10s %-10s %-10s %-10s %-10s\n", frame_counter, residue_name, chain_name, residue_number, ss_code
    }' "secondary_structure${frame_counter}.txt" >> ${final_output_file} # Append directly to the final file

    rm secondary_structure${frame_counter}.txt
}

export -f calculate_secondary_structure
seq 1 ${number_of_frames} | parallel --bar --line-buffer --block 1k --round-robin -j192 calculate_secondary_structure {}

Parallel-version:

GNU parallel 20240822

Solution

  • The problem is that you do not let GNU Parallel serialize the output, but you append directly to the $final_output_file behind GNU Parallel's back.

    So you probably want:

    calculate_secondary_structure() {
        frame_counter=$1
        # process only chain A, as it is polyQ
        ${stride_path}/stride ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb -ca -fsecondary_structure${frame_counter}.txt
        rm ${pdbs_dir}/pdb_for_cluster${frame_counter}.pdb
    
        awk -v frame_counter="$frame_counter" '
        BEGIN { OFS="          " } # 10 spaces as a separator
        /^ASG/ {
            residue_name = substr($0, 6, 3)
            chain_name = substr($0, 10, 1)
            residue_number = substr($0, 17, 4)
            ss_code = substr($0, 25, 1)
    
            # Print the frame number followed by the extracted fields with 10 spaces between them
            printf "%-10s %-10s %-10s %-10s %-10s\n", frame_counter, residue_name, chain_name, residue_number, ss_code
        }' "secondary_structure${frame_counter}.txt"
    
        rm secondary_structure${frame_counter}.txt
    }
    
    export -f calculate_secondary_structure
    seq 1 ${number_of_frames} |
      parallel --bar --line-buffer -j192 calculate_secondary_structure {} >> ${final_output_file}
    

    (--block 1k --round-robin only make sense if you use --pipe/--pipepart. They do nothing otherwise. Also consider using -j100% (or simply leave it out - as that is default) if you are running on a 192 core server. This way you do not need to change 192 when you get your new 512 core server).