I imported a vector of p-values from an Excel table. The numbers are given as superscripted Unicode strings. After hours of trying I still struggle to convert them into numbers.
See example below. Simple conversion with as.numeric()
doesn't work. I also tried to use Regex to capture the superscripted numbers, but it turned out that each superscripted number has a distinct Unicode code, for which there is no translation.
test <- c("0.0126", "0.000289", "4.26x10⁻¹⁴", "6.36x10⁻⁴⁸",
"4.35x10⁻⁹", "0.115", "0.0982", "0.000187", "0.0484", "0.000223")
as.numeric(test)
Does somebody know of an R-package which could do the translation painlessly, or do I have to translate the codes one by one into digits?
This kind of formatting is definitely not very portable... Here's one possible solution though, for the exercise...
test <- c("0.0126", "0.000289", "4.26x10⁻¹⁴", "6.36x10⁻⁴⁸",
"4.35x10⁻⁹", "0.115", "0.0982", "0.000187", "0.0484",
"0.000223")
library(utf8)
library(stringr)
# normalize, ie everything to "normal text"
testnorm <- utf8_normalize(test, map_case = TRUE, map_compat = TRUE)
# replace exponent part
# \\N{Minus Sign} is the unicode name of the minus sign symbol
# (see [ICU regex](http://userguide.icu-project.org/strings/regexp))
# it is necessary because the "-" is not a plain text minus sign...
testnorm <- str_replace_all(testnorm, "x10\\N{Minus Sign}", "e-")
# evaluate these character strings
p_vals <- sapply(X = testnorm,
FUN = function(x) eval(parse(text = x)),
USE.NAMES = FALSE
)
# everything got adjusted to the "e-48" element...
format(p_vals, digits = 2, scientific = F)