Number as character cannot be converted to numeric in R

I extracted some data from a Chinese pdf file.

The numbers in the columns are extracted as follows (for example): -122, 29458, 9.

I copy pasted the outputs of some cells. However, these characters are not the same as -122, 29458, 9, respectively.

Class of the column is "character".

Hence parse.number() produces NA in all of these cases.

Any suggestions regarding what I should do?

This is the pdf file in question:

I extracted the data from page 49 (53rd page of the pdf file), using the following code:


file <- tempfile()

url <- paste0("") 

download.file(url, file, headers = c("User-Agent" = "My Custom User Agent"))

pdf_data <- pdf_text(file)

replace_spaces_and_commas <- function(x) {
  str_replace_all(x, "[ ,]", "")

pdf <- pdf_data[53:71]

tab_pdf <- str_split(pdf, "\n")

for (i in 1:19) {
  assign(paste0("tab_pdf_", i), tab_pdf[[i]])

the_names <- c("country", "year_2013", "year_2014", "year_2015", "year_2016", "year_2017", "year_2018", "year_2019", "year_2020", "year_2021")


pdf_clean1 <- tab_pdf_1[14:60] %>%
  str_trim %>%
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "")

I tried both, e.g., as.numeric(pdf_clean1$year_2013) andparse_number(pdf_clean$year_2013)

Both produced NAs, because the outcome for all of "9" == "9" "-122" == "-122" "29458" == "29458"are "FALSE".

The result of dput(head(pdf_clean1$year_2013)) is

c("-122", "29458", "-74", "16357", "2", "-534")


  • A base R approach using utf8ToInt to get an integer from the utf8 code, then subtracting 65248 from integers that are in a specific range (> 126, see ascii table) to get the desired number, finally bringing the integer back to utf8 with intToUtf8.

        sapply(unlist(strsplit("-122, 29458, 9", "")), \(x)
          ifelse(utf8ToInt(x) > 126, intToUtf8(utf8ToInt(x) - 65248), x)),
        collapse=""), ", ")))
    [1]  -122 29458     9