Search code examples
rtidyversetidyr

Number as character cannot be converted to numeric in R


I extracted some data from a Chinese pdf file.

The numbers in the columns are extracted as follows (for example): -122, 29458, 9.

I copy pasted the outputs of some cells. However, these characters are not the same as -122, 29458, 9, respectively.

Class of the column is "character".

Hence parse.number() produces NA in all of these cases.

Any suggestions regarding what I should do?

This is the pdf file in question: http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf

I extracted the data from page 49 (53rd page of the pdf file), using the following code:

library(tidyverse)
library(pdftools)

file <- tempfile()

url <- paste0("http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf") 

download.file(url, file, headers = c("User-Agent" = "My Custom User Agent"))



pdf_data <- pdf_text(file)

replace_spaces_and_commas <- function(x) {
  str_replace_all(x, "[ ,]", "")
}

pdf <- pdf_data[53:71]

tab_pdf <- str_split(pdf, "\n")

for (i in 1:19) {
  assign(paste0("tab_pdf_", i), tab_pdf[[i]])
}

the_names <- c("country", "year_2013", "year_2014", "year_2015", "year_2016", "year_2017", "year_2018", "year_2019", "year_2020", "year_2021")

view(tab_pdf_1)

pdf_clean1 <- tab_pdf_1[14:60] %>%
  str_trim %>%
  str_replace_all(",", "") %>%
  str_split("\\s{2,}", simplify = TRUE) %>%
  data.frame(stringsAsFactors = FALSE) %>%
  setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "")

I tried both, e.g., as.numeric(pdf_clean1$year_2013) andparse_number(pdf_clean$year_2013)

Both produced NAs, because the outcome for all of "9" == "9" "-122" == "-122" "29458" == "29458"are "FALSE".

The result of dput(head(pdf_clean1$year_2013)) is

c("-122", "29458", "-74", "16357", "2", "-534")

sessionInfo()

Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4.1

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] countrycode_1.5.0 magrittr_2.0.3 pdftools_3.3.3
[4] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[7] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4
[10] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
[13] tidyverse_2.0.0

loaded via a namespace (and not attached):
[1] gtable_0.3.3 compiler_4.3.1 qpdf_1.3.2
[4] tidyselect_1.2.0 Rcpp_1.0.11 scales_1.2.1
[7] R6_2.5.1 generics_0.1.3 knitr_1.42
[10] munsell_0.5.0 pillar_1.9.0 tzdb_0.4.0
[13] rlang_1.1.1 utf8_1.2.3 stringi_1.7.12
[16] xfun_0.39 timechange_0.2.0 cli_3.6.1
[19] withr_2.5.0 grid_4.3.1 rstudioapi_0.15.0
[22] hms_1.1.3 askpass_1.1 lifecycle_1.0.3
[25] vctrs_0.6.3 glue_1.6.2 fansi_1.0.4
[28] colorspace_2.1-0 tools_4.3.1 pkgconfig_2.0.3```

Solution

  • A base R approach using utf8ToInt to get an integer from the utf8 code, then subtracting 65248 from integers that are in a specific range (> 126, see ascii table) to get the desired number, finally bringing the integer back to utf8 with intToUtf8.

    as.numeric(
      unlist(strsplit(paste(
        sapply(unlist(strsplit("-122, 29458, 9", "")), \(x)
          ifelse(utf8ToInt(x) > 126, intToUtf8(utf8ToInt(x) - 65248), x)),
        collapse=""), ", ")))
    [1]  -122 29458     9