Search code examples
rregexbioinformaticsstringrvcf-variant-call-format

Extract specific word matching the pattern


I have data frame with a column:

nf1$Info = AC=1;AF=0.500;AN=2;BaseQRankSum=-1.026e+00;ClippingRankSum=-1.026e+00;DP=4;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=28.25;MQRankSum=-1.026e+00;QD=10.18;ReadPosRankSum=1.03;SOR=0.693

I'm trying to extract a specific value from this column.

For e.g. I'm interested in "MQRankSum" and I used:

str_extract(nf1$Info,"[MQRankSum]+=[:punct:]+[0-9]+[.]+[0-9]+")

It returns value for BaseQRankSum instead of MQRankSum.


Solution

  • Including characters into square brackets creates a character class matching any of the defined characters, so [yes]+ matches yyyyyyyyy, eyyyyss, etc.

    What you want to do is to match a word MQRankSum, =, and then any chars other than ;:

    str_extract(nf1$Info,"MQRankSum=[^;]+")
    

    If you want to exlcude MQRankSum= from the match, use a lookbehind:

    str_extract(nf1$Info,"(?<=MQRankSum=)[^;]+")
                          ^^^^^^^^^^^^^^^
    

    The (?<=MQRankSum=) positive lookbehind will make sure there is MQRankSum= text immediately to the left of the current location, and only after that will match 1 or more chars other than ;.