Search code examples
regexrscientific-notation

How to capture minus sign in scientific notation with regex?


I was trying to answer a question (that later got deleted) that I think was asking about extracting text representations of scientific notation. (Using R's implementation of regex that requires double escapes for meta-characters and can be used in either pure PCRE or Perl modes, the difference between which I don't really understand.) I've solved most of the task but still seem to be failing to capture the leading minus-sign within a capture group. The only way I seem to get it to succeed is by using the leading open-parenthesis:

> txt <- c("this is some random text (2.22222222e-200)", "other random (3.33333e4)", "yet a third(-1.33333e-40)", 'and a fourth w/o the "e" (2.22222222-200)')
> sub("^(.+\\()([-+]{0,1}[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200" 

> sub("^(.+\\()([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200" 
 #but that seems to be "cheating" ... my failures follow:

> sub("^(.+)([-+]?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "1.33333e-40"     "2.22222222-200" 
> sub("^(.+)(-?[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "1.33333e-40"     "2.22222222-200" 
> sub("^(.+)(-*[0-9][.][0-9]{1,16}[eE]*[-+]*[0-9]{0,3})(.+$)", "\\2" ,txt)
[1] "2.22222222e-200" "3.33333e4"       "1.33333e-40"     "2.22222222-200" 

I've searched SO to the extent of my patience with terms like `scientific notation regex minus'


Solution

  • You can try

     library(stringr)
     unlist(str_extract_all(txt, '-?[0-9.]+e?[-+]?[0-9]*'))
     #[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200" 
    

    Using method based on capturing after leading parentheses

     str_extract(txt, '(?<=\\()[^)]*')
     #[1] "2.22222222e-200" "3.33333e4"       "-1.33333e-40"    "2.22222222-200"