Search code examples
regexunixmultiple-columnstrimcut

Using regular expression to maintain part of information on column


Good morning, I have a file looking like this:

file.txt

G05829  H05037  A   A*02:01:01  A*11:01:01
G05829  H05037  DRA DRA*01:01:01    DRA*01:02:02
G05829  H05037  DPB1    DPB1*04:01:01   DPB1*04:02:01
G05829  H05037  DRB3    DRB3*01:01:02   DRB3*01:01:02
G05829  H05037  B   B*08:01 B*44:02
G05829  H05037  DRB1    DRB1*03:01:01   DRB1*04:01:01
G15526  H12517  B   B*07:02 B*35:01
G15526  H12517  DRB5    DRB5*01:01:01   DRB5*01:01:01
G15526  H12517  DRA DRA*01:02:03    DRA*01:02:03

I need to have columns 4 and 5 in the format

A*01:01  A*01:01
DRA*01:01 DRA*01:01
(...)

So, the first letters that identify the locus, star, 2 digits, column and 2 two digits.

My problem is that not eery columns have the same length. Some will have more detailed and will have 2 or 3 colons (e.g. DPB1*01:02:02 or DQB1*49:34:01:03)while other will have only one colon (the intended output, e.g. DPA*01:01).

I have tried some different approaches but I am only able to crop from the end ( what does no work because they have different lengths), crop from the beginning (Alsop dos not work because the first identifier can be 1 letter or 3 letters and number (e.g. identifier can be 'A' or 'DPB1'). I was trying with sed, by I end up replacing all the colons. My attempts:

sed 's/\:[0-9][0-9]//g' file.txt 

This crops all the colon + digits WRONG

sed 's/\:[0-9][0-9]\:[0-9][0-9]\t/\t/g' file.txt 

This crops only the second column and does not account for differences on length in each column.

I need something that will:

recognises identifier (A,B,C,DPA1,DQB1), the star (*), the numbers after the start (01,02,13 (..)), first colon (:) and following digits before the next column (01,02,03 ...)

so, the desired output is something like this:

niceoutput.txt

G05829  H05037  A   A*02:01 A*11:01
G05829  H05037  DRA DRA*01:01   DRA*01:02
G05829  H05037  DPB1    DPB1*04:01  DPB1*04:02
G05829  H05037  DRB3    DRB3*01:01  DRB3*01:01
G05829  H05037  DRB1    DRB1*03:01  DRB1*04:01
G05829  H05037  B   B*08:01 B*44:02
G15526  H12517  B   B*07:02 B*35:01
G15526  H12517  DRB5    DRB5*01:01  DRB5*01:01
G15526  H12517  DRA DRA*01:02   DRA*01:02

thank you!


Solution

  • This sed will give You Your desired output:

    sed 's/\([A-Z]\{1,\}[0-9]*\*[0-9][0-9]:[0-9][0-9]\):[0-9][0-9]/\1/g'
    

    Test:

    $ sed 's/\([A-Z]\{1,\}[0-9]*\*[0-9][0-9]:[0-9][0-9]\):[0-9][0-9]/\1/g' file.txt > niceoutput.txt
    $ cat niceoutput.txt
        G05829  H05037  A   A*02:01  A*11:01
        G05829  H05037  DRA DRA*01:01    DRA*01:02
        G05829  H05037  DPB1    DPB1*04:01   DPB1*04:02
        G05829  H05037  DRB3    DRB3*01:01   DRB3*01:01
        G05829  H05037  B   B*08:01 B*44:02
        G05829  H05037  DRB1    DRB1*03:01   DRB1*04:01
        G15526  H12517  B   B*07:02 B*35:01
        G15526  H12517  DRB5    DRB5*01:01   DRB5*01:01
        G15526  H12517  DRA DRA*01:02    DRA*01:02
    

    However in your question You are mentioning that part :[0-9] can be n times, but You do not have that test case in Your example, if that's true You will need to change sed into this:

    sed 's/\([A-Z]\{1,\}[0-9]*\*[0-9][0-9]:[0-9][0-9]\)\(:[0-9][0-9]\)*/\1/g'
    

    Test2:

    $ cat jose_testcase2.txt
    DPB1*01:02:02 or DQB1*49:34:01:03
    DXX*05:05
    
    $ sed 's/\([A-Z]\{1,\}[0-9]*\*[0-9][0-9]:[0-9][0-9]\)\(:[0-9][0-9]\)*/\1/g' jose_testcase2.txt
    DPB1*01:02 or DQB1*49:34
    DXX*05:05