Search code examples
linuxbashloopssedbioinformatics

How to add headers to columns with second column as directory name


I have count file containing IDs and counts in multiple directories (for each accession SRRXXXXX). I want to add header "gene_id" and SRRabcdXXX in each file using bash loop.

Directory structure like this:

SRRabcd
  count.txt
SRRefgh
  count.txt

MY FILE(s)

gene1 194
gene2 40     

WHAT I AM DOING

#!/bin/bash
for dir in /home/path/to/dir/SRR*/
do
sed -i '1s/^/gene_id\t"${dir}"\n/' "$dir"/count.txt
done

MY OUTPUT

gene_id "${dir}"
gene1 194
gene2 40

My Desired Output (for individual files)

gene_id SRRabcdef
gene1 194
gene2 40

Solution

  • To replace ${dir} with its actual value you need to insure ${dir} is wrapped in double quotes; while you do have "${dir}" in your sed script, this is embedded in a pair of single quotes which effectively negates the inner double quotes with the net result that you end up the literal string "${dir}" in your output.

    One easy approach would be to append 3 strings together to form your sed script, eg:

    # '1s/^/gene_id\t' + "${dir}" + '\n/' 
    
    sed '1s/^/gene_id\t'"${dir}"'\n/' 
    

    But the simpler (and recommended) approach is to insure the entire sed script is wrapped in double quotes, eg:

    sed "1s/^/gene_id\t${dir}\n/" "$dir"/count.txt
        ^                       ^
    

    Sample data:

    $ head SRR*/count.txt
    ==> SRRabcd/count.txt <==
    gene1 194
    gene2 40
    
    ==> SRRefgh/count.txt <==
    gene1 395
    gene2 17
    

    Modified script:

    for dir in SRR*
    do 
        echo "########## $dir"
        sed "1s/^/gene_id\t${dir}\n/" "$dir"/count.txt
    done
    

    This generates:

    ########## SRRabcd
    gene_id SRRabcd
    gene1 194
    gene2 40
    ########## SRRefgh
    gene_id SRRefgh
    gene1 395
    gene2 17
    

    Once you confirm the results are correct you can add the -i flag.