Search code examples
bashshellterminal

Column editing and renaming of a tab-delimited table in Bash


Apologies for the basic question, but I am pretty stuck. I have a basic tab-delimited text file called map.txt. The first line of the file is column names, preceded with the # character. Note, this is a tab-delimited file. Here is the output of head map.txt:

#SampleID   BarcodeSequence LinkerPrimerSequence    sample_type Description geneticSampleID
OSBS.087.39.M.32.18.20140227    TCCCTTGTCTCC    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A1   OSBS_087-M-32-18-20140227-gen
OSBS.048.41.M.37.33.20140227    ACGAGACTGATT    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A2   OSBS_048-M-37-33-20140227-gen
OSBS.048.23.M.15.31.20140227    GCTGTACGGATT    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A3   OSBS_048-M-15-31-20140227-gen
OSBS.047.21.M.20.3.20140227 ATCACCAGGTGT    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A4   OSBS_047-M-20-3-20140227-gen
OSBS.119.23.M.18.38.20140227    TGGTCAACGATA    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A5   OSBS_119-M-18-38-20140227-gen
OSBS.047.41.M.22.36.20140227    ATCGCACAGTAA    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A6   OSBS_047-M-22-36-20140227-gen
OSBS.087.41.M.40.21.20140227    GTCGTGTAGCCT    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A7   OSBS_087-M-40-21-20140227-gen
OSBS.048.21.M.5.11.20140227 AGCGGAGGTTAG    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A8   OSBS_048-M-5-11-20140227-gen
OSBS.119.39.M.27.5.20140227 ATCCTTTGGTTC    CGGCTGCGTTCTTCATCGATGC  soil    Plate 1A9   OSBS_119-M-27-5-20140227-gen

I would like to change the names geneticSampleID to Name and BarcodeSequence to Index1. I would then like to delete all columns except for the newly named Name and Index1 columns. I'd like the first line of the file to remain the colum names Name and Index1, preceded by the # character. Changing the names of the columns is straightforward enough in vim, deleting columns is not.

Desired output:

#Index1 Name 
TCCCTTGTCTCC OSBS_087-M-32-18-20140227-gen
ACGAGACTGATT OSBS_048-M-37-33-20140227-gen
GCTGTACGGATT OSBS_048-M-15-31-20140227-gen
ATCACCAGGTGT OSBS_047-M-20-3-20140227-gen
TGGTCAACGATA OSBS_119-M-18-38-20140227-gen
ATCGCACAGTAA OSBS_047-M-22-36-20140227-gen
GTCGTGTAGCCT OSBS_087-M-40-21-20140227-gen
AGCGGAGGTTAG OSBS_048-M-5-11-20140227-gen
ATCCTTTGGTTC OSBS_119-M-27-5-20140227-gen

Solution

  • You can do this with shell commands like awk and sed, for example:

    cat map.txt | awk '{ print $2, $7}' | sed 's/BarcodeSequence/#Index1 Name/'
    

    Output:

    #Index1 Name 
    TCCCTTGTCTCC OSBS_087-M-32-18-20140227-gen
    ACGAGACTGATT OSBS_048-M-37-33-20140227-gen
    GCTGTACGGATT OSBS_048-M-15-31-20140227-gen
    ATCACCAGGTGT OSBS_047-M-20-3-20140227-gen
    TGGTCAACGATA OSBS_119-M-18-38-20140227-gen
    ATCGCACAGTAA OSBS_047-M-22-36-20140227-gen
    GTCGTGTAGCCT OSBS_087-M-40-21-20140227-gen
    AGCGGAGGTTAG OSBS_048-M-5-11-20140227-gen
    ATCCTTTGGTTC OSBS_119-M-27-5-20140227-gen
    

    Although now we are using sed to replace 1 column name with 2 new names and that feels hacky to me. So in this case I would prefer the awk-only way:

    awk 'BEGIN{ print "#Name", "Index1"} NR>1 { print $2, $7}' map.txt