Search code examples
linuxawksedextractxargs

how to edit the sample_ids using sed


I have a file that contain sample IDs. I want to generate a sample participant look up table which should have two columns separated by tab. The first column should be GTEX-1117F-0226-SM-5GZZ7 GTEX-1117F I was able to get the first ID from the file:

grep "GTEX" gene_tpm_2017-06-05_v8_brain_cortex.gct | awk '{$1=$2=$3=$4=""; printf $0 }' | xargs -n1 > ids_bed.txt

Now my ids_bed.txt file look like this:

GTEX-1117F-3226-SM-5N9CT
GTEX-111FC-3126-SM-5GZZ2
GTEX-1128S-2726-SM-5H12C
GTEX-117XS-3026-SM-5N9CA
GTEX-1192X-3126-SM-5N9BY
GTEX-11DXW-1126-SM-5H12Q

I want to add GTEX-1117F as the second column and so on I tried to do this:

sed -re 's/(GTEX-[[:alnum:]]+)_\1/\1/g' ids_bed.txt > ids_bed_1.txt

but it doesn't generate the second column. I want my final file to look like this: both the columns separated by tab:

GTEX-1117F-3226-SM-5N9CT GTEX-1117F
GTEX-111FC-3126-SM-5GZZ2 GTEX-111FC

Solution

  • I would use GNU sed for this task following way, let file.txt content be

    GTEX-1117F-3226-SM-5N9CT
    GTEX-111FC-3126-SM-5GZZ2
    GTEX-1128S-2726-SM-5H12C
    GTEX-117XS-3026-SM-5N9CA
    GTEX-1192X-3126-SM-5N9BY
    GTEX-11DXW-1126-SM-5H12Q
    

    then

    sed 's/\(GTEX-[^-]*\)\(.*\)/\1\2\t\1/' file.txt
    

    gives output

    GTEX-1117F-3226-SM-5N9CT    GTEX-1117F
    GTEX-111FC-3126-SM-5GZZ2    GTEX-111FC
    GTEX-1128S-2726-SM-5H12C    GTEX-1128S
    GTEX-117XS-3026-SM-5N9CA    GTEX-117XS
    GTEX-1192X-3126-SM-5N9BY    GTEX-1192X
    GTEX-11DXW-1126-SM-5H12Q    GTEX-11DXW
    

    Explanation: I use 2 capturing groups one for GTEX-(anything but -) and one for rest of line. I replace whole line by \1\2 which is whole line, TAB then content of 1st group.

    (tested in GNU sed 4.7)