Search code examples
linuxtext-processingvcf-variant-call-format

Within each column of a large file, removing everything after a certain delimiter


I have a file that consists of many columns which look like this:

0/0:7,0:7:21:0,21,245 0/0:9,0:9:27:0,27,339 0/0:13,0:13:39:0,39,524

I want to remove everything within each column so that the output looks like this:

0/0 0/0 0/0

There are far too many columns to manually apply a solution like awk where you have to type $1, $2 for each column.

I have tried a number of solutions in R, none of which gave the results I am looking for. They all split the column instead of just retaining the first entry. This is a vcf file, and I have tried using vcf2tsv, but I cannot get the dependencies to work.

For example I tried this code:

test<-sub('(:<=\\:).*$', '', x, perl=TRUE)

Which gave me the following:

"c(\"0/0:8,0:8:24:0,24,305\", \"0/0:6,0:6:18:0,18,242\", \"0/0:5,0:5:15:0,15,200\",

Clearly I do not understand the code. Any help is appreciated.


Solution

  • With the sample input in the question you can use

    sed 's#:[^ ]*##g' inputfile
    

    to get the output

    0/0 0/0 0/0
    

    The sed script will replace everything starting with a colon (:) followed by any characters except space ([^ ]) with an empty string for all occurrences (g). This means it will do this in all columns separated by a space.