I'm trying to write a for loop that unzips fastq.gz files that contain R1 in the file name, determines # of lines in each file, and divides # of lines by 4. Ideally I could also write this into a txt file with two columns (file name and # of lines/4).
This loop unzips R1 fastq files and deterimnes # of lines in each file but does not divide by 4 (or save output into a txt file).
for i in $(ls ./*R1*);
do
gzcat ./$i | wc -l
done;
Other posts on here suggest using bc to divide in bash, but I haven't been able to integrate this into a loop.
You never use for i in $(ls anything)
, see Bash Pitfalls #1. Your loop will fail for filenames with spaces or any other special characters. For most circumstances, you simply iterate over the files with for i in path/*; do ...
, but understand that can fail if the filenames contain the '\n'
character as part of the name. The optimal for handling all filenames is to use find
as while read -r name; do ... done < <(find path -type f -name "*.gz")
(note process substitution, < <(...)
is a bash only construct, pipe to the loop if using POSIX shell)
Next, to write the name and number of lines / 4 to a new file, wrap your entire loop in a new scope between { .... }
and simply redirect all output at once to the new file.
You should also add validations to check if the file is a directory ending in gz
and skip any found, as well as skipping any empty file (zero file size)
If you it altogether, you could do something like:
{
for i in R1/*.gz; do
[ -d "$i" ] && continue ## skip any directories
[ -s "$1" ] && continue ## skip empty files
nlines=$(gzcat "$i" | wc -l) ## get number of lines
printf "%s\t%s\n" "$i" $((nlines / 4)) ## output name, nlines / 4
done
} > newfile ## redirect all output to newfile
(output is written with a tab
character "\t"
separating the name and number / 4 -- adjust as desired)
Look things over and let me know if you have any questions.