Search code examples
stringbashawkin-placevcf-variant-call-format

bash file substring add/replace inplace in matches only


I've a VCF file with different entries and I need to replace (or add if absent) a substring based on multiple matches. e.g.

head file

### OUTPUT:
1   47746672    .   A   G   .   .   pz_name=GHARTxI16uuT15921;qual=2201;
1   47746672    .   C   G   .   .   pz_name=GHARMALFI17uuM12201;qual=1932;status=RE;
1   47746675    .   C   G   .   .   pz_name=GHARIGANI17uuA10531;qual=1541;
1   47746675    .   C   G   .   .   pz_name=GHARTxI16uuT15921;qual=1440;status=AC;
1   47746675    .   C   G   .   .   pz_name=GHARFSGSI17uuC19091;qual=816;
# ...

I need to look at some conditions, in order to isolate only a specific line for each combination of variant-patient (both can be repeated but their conbination is unique) e.g. in order to take 4th line of example:

  • that $2 == "47746675" && $3 == "C" && $4 == "G"
  • and pz_name=GHARTxI16uuT15921

in this specific line I then need to:

  • add status=something; if absent
  • replace status=<something-else> with status=something if present

How can I do it all with some kind of inplace replacement in bash? Is it possible? Alternatives performance effective approaches suggestion will be very appreciated!

Thanks a lot in advance for any help!


Solution

  • Is this what you're trying do do?

    $ awk '{print $0 (/pz_name=GHARFSGSI17uuC19091/ && !/status=/ ? "status=something;" : "")}' file
    1   47746675    .   C   G   .   .   ad_alt=73;ad_ref=65;denovo=NA;dp_bin=50;father_dp_bin=NA;father_gt=NA;filter=PASS;gene_name_correct=STIL;gq=99;gt=het;mother_dp_bin=NA;mother_gt=NA;perc_alt=0.5252;pz_name=GHARMALFI17uuM11471;qual=2201;
    1   47746675    .   C   G   .   .   ad_alt=65;ad_ref=57;denovo=NA;dp_bin=50;father_dp_bin=NA;father_gt=NA;filter=PASS;gene_name_correct=STIL;gq=99;gt=het;mother_dp_bin=NA;mother_gt=NA;perc_alt=0.5242;pz_name=GHARMALFI17uuM12201;qual=1932;status=RE;
    1   47746675    .   C   G   .   .   ad_alt=53;ad_ref=38;denovo=NA;dp_bin=50;father_dp_bin=NA;father_gt=NA;filter=PASS;gene_name_correct=STIL;gq=99;gt=het;mother_dp_bin=NA;mother_gt=NA;perc_alt=0.5824;pz_name=GHARIGANI17uuA10531;qual=1541;
    1   47746675    .   C   G   .   .   ad_alt=48;ad_ref=49;denovo=NA;dp_bin=50;father_dp_bin=NA;father_gt=NA;filter=PASS;gene_name_correct=STIL;gq=99;gt=het;mother_dp_bin=NA;mother_gt=NA;perc_alt=0.4948;pz_name=GHARTxI16uuT15921;qual=1440;status=AC;
    1   47746675    .   C   G   .   .   ad_alt=29;ad_ref=39;denovo=NA;dp_bin=50;father_dp_bin=NA;father_gt=NA;filter=PASS;gene_name_correct=STIL;gq=99;gt=het;mother_dp_bin=NA;mother_gt=NA;perc_alt=0.4265;pz_name=GHARFSGSI17uuC19091;qual=816;status=something;
    

    If you want "inplace" editing then with GNU awk use awk -i inplace '...' file, or with any awk use awk '...' file > tmp && mv tmp file.

    UPDATE: given your updated question:

    $ awk '$2 == "47746675" && $4 == "C" && $5 == "G" && /pz_name=GHARFSGSI17uuC19091/{ sub(/(status=.*)?$/,"status=something;")} 1' file} 1' file
    ### OUTPUT:
    1   47746672    .   A   G   .   .   pz_name=GHARTxI16uuT15921;qual=2201;
    1   47746672    .   C   G   .   .   pz_name=GHARMALFI17uuM12201;qual=1932;status=RE;
    1   47746675    .   C   G   .   .   pz_name=GHARIGANI17uuA10531;qual=1541;
    1   47746675    .   C   G   .   .   pz_name=GHARTxI16uuT15921;qual=1440;status=AC;
    1   47746675    .   C   G   .   .   pz_name=GHARFSGSI17uuC19091;qual=816;status=something;
    # ...