Search code examples
bashfastq

Writing a script for large text file manipulation (iterative substitution of duplicated lines), weird bugs and very slow.


I am trying to write a script which takes a directory containing text files (384 of them) and modifies duplicate lines that have a specific format in order to make them not duplicates.

In particular, I have files in which some lines begin with the '@' character and contain the substring 0:0. A subset of these lines are duplicated one or more times. For those that are duplicated, I'd like to replace 0:0 with i:0 where i starts at 1 and is incremented.

So far I've written a bash script that finds duplicated lines beginning with '@', writes them to a file, then reads them back and uses sed in a while loop to search and replace the first occurrence of the line to be replaced. This is it below:

#!/bin/bash                                                                                                                                      


fdir=$1"*"

#for each fastq file

for f in $fdir
do
    (

#find duplicated read names and write to file $f.txt

sort $f | uniq -d | grep ^@  > "$f".txt
#loop over each duplicated readname

    while read in; do
        rname=$in
        i=1

        #while this readname still exists in the file increment and replace

        while grep -q "$rname" $f; do
            replace=${rname/0:0/$i:0}
            sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
            let "i+=1"

        done

    done < "$f".txt



rm "$f".txt
rm "$f".bu

done

echo "done" >> progress.txt

)&

background=( $(jobs -p) )
if (( ${#background[@]} ==40)); then
wait -n
fi

done

The problem with it is that its impractically slow. I ran it on a 48 core computer for over 3 days and it hardly got through 30 files. It also seemed to have removed about 10 files and I'm not sure why.

My question is where are the bugs coming from and how can I do this more efficiently? I'm open to using other programming languages or changing my approach.

EDIT

Strangely the loop works fine on one file. Basically I ran

sort $f | uniq -d | grep ^@  > "$f".txt


while read in; do
    rname=$in
    i=1

    while grep -q "$rname" $f; do
        replace=${rname/0:0/$i:0}
        sed -i.bu "0,/$rname/s/$rname/$replace/" "$f"
        let "i+=1"

    done

done < "$f".txt

To give you an idea of what the files look like below are a few lines from one of them. The thing is that even though it works for the one file, it's slow. Like multiple hours for one file of 7.5 M. I'm wondering if there's a more practical approach.

With regard to the file deletions and other bugs I have no idea what was happening Maybe it was running into memory collisions or something when they were run in parallel?

Sample input:

@D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
@D00269:138:HJG2TADXX:2:1101:0:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG

Sample output:

@D00269:138:HJG2TADXX:2:1101:1:0 1:N:0:CCTAGAAT+ATTCCTCT
GATAAGGACGGCTGGTCCCTGTGGTACTCAGAGTATCGCTTCCCTGAAGA
+
CCCFFFFFHHFHHIIJJJJIIIJJIJIJIJJIIBFHIHIIJJJJJJIJIG
@D00269:138:HJG2TADXX:2:1101:2:0 1:N:0:CCTAGAAT+ATTCCTCT
CAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCG

Solution

  • #!/bin/bash                                                                     
    
    i=4
    
    sort $1 | uniq -d | grep ^@ > dups.txt
    
    while read in; do
    
        if [ $((i%4))=0 ] && grep -q "$in" dups.txt; then
            x="$in"
            x=${x/"0:0 "/$i":0 "}
            echo "$x" >> $1"fixed.txt"
    
        else
            echo "$in" >> $1"fixed.txt"
    
        fi
    
        let "i+=1"
    done < $1