I've a VCF file with different entries and I need to replace (or add if absent) a substring based on multiple matches. e.g.
head file
### OUTPUT:
1 47746672 . A G . . pz_name=GHARTxI16uuT15921;qual=2201;
1 47746672 . C G . . pz_name=GHARMALFI17uuM12201;qual=1932;status=RE;
1 47746675 . C G . . pz_name=GHARIGANI17uuA10531;qual=1541;
1 47746675 . C G . . pz_name=GHARTxI16uuT15921;qual=1440;status=AC;
1 47746675 . C G . . pz_name=GHARFSGSI17uuC19091;qual=816;
# ...
I need to look at some conditions, in order to isolate only a specific line for each combination of variant-patient (both can be repeated but their conbination is unique) e.g. in order to take 4th line of example:
$2 == "47746675" && $3 == "C" && $4 == "G"
pz_name=GHARTxI16uuT15921
in this specific line I then need to:
status=something;
if absent status=<something-else>
with status=something
if presentHow can I do it all with some kind of inplace replacement in bash? Is it possible? Alternatives performance effective approaches suggestion will be very appreciated!
Thanks a lot in advance for any help!
Is this what you're trying do do?
$ awk '{print $0 (/pz_name=GHARFSGSI17uuC19091/ && !/status=/ ? "status=something;" : "")}' file
1 47746675 . C G . . ad_alt=73;ad_ref=65;denovo=NA;dp_bin=50;father_dp_bin=NA;father_gt=NA;filter=PASS;gene_name_correct=STIL;gq=99;gt=het;mother_dp_bin=NA;mother_gt=NA;perc_alt=0.5252;pz_name=GHARMALFI17uuM11471;qual=2201;
1 47746675 . C G . . ad_alt=65;ad_ref=57;denovo=NA;dp_bin=50;father_dp_bin=NA;father_gt=NA;filter=PASS;gene_name_correct=STIL;gq=99;gt=het;mother_dp_bin=NA;mother_gt=NA;perc_alt=0.5242;pz_name=GHARMALFI17uuM12201;qual=1932;status=RE;
1 47746675 . C G . . ad_alt=53;ad_ref=38;denovo=NA;dp_bin=50;father_dp_bin=NA;father_gt=NA;filter=PASS;gene_name_correct=STIL;gq=99;gt=het;mother_dp_bin=NA;mother_gt=NA;perc_alt=0.5824;pz_name=GHARIGANI17uuA10531;qual=1541;
1 47746675 . C G . . ad_alt=48;ad_ref=49;denovo=NA;dp_bin=50;father_dp_bin=NA;father_gt=NA;filter=PASS;gene_name_correct=STIL;gq=99;gt=het;mother_dp_bin=NA;mother_gt=NA;perc_alt=0.4948;pz_name=GHARTxI16uuT15921;qual=1440;status=AC;
1 47746675 . C G . . ad_alt=29;ad_ref=39;denovo=NA;dp_bin=50;father_dp_bin=NA;father_gt=NA;filter=PASS;gene_name_correct=STIL;gq=99;gt=het;mother_dp_bin=NA;mother_gt=NA;perc_alt=0.4265;pz_name=GHARFSGSI17uuC19091;qual=816;status=something;
If you want "inplace" editing then with GNU awk use awk -i inplace '...' file
, or with any awk use awk '...' file > tmp && mv tmp file
.
UPDATE: given your updated question:
$ awk '$2 == "47746675" && $4 == "C" && $5 == "G" && /pz_name=GHARFSGSI17uuC19091/{ sub(/(status=.*)?$/,"status=something;")} 1' file} 1' file
### OUTPUT:
1 47746672 . A G . . pz_name=GHARTxI16uuT15921;qual=2201;
1 47746672 . C G . . pz_name=GHARMALFI17uuM12201;qual=1932;status=RE;
1 47746675 . C G . . pz_name=GHARIGANI17uuA10531;qual=1541;
1 47746675 . C G . . pz_name=GHARTxI16uuT15921;qual=1440;status=AC;
1 47746675 . C G . . pz_name=GHARFSGSI17uuC19091;qual=816;status=something;
# ...