I extracted some data from a Chinese pdf file.
The numbers in the columns are extracted as follows (for example): -122, 29458, 9.
I copy pasted the outputs of some cells. However, these characters are not the same as -122, 29458, 9, respectively.
Class of the column is "character".
Hence parse.number() produces NA in all of these cases.
Any suggestions regarding what I should do?
This is the pdf file in question: http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf
I extracted the data from page 49 (53rd page of the pdf file), using the following code:
library(tidyverse)
library(pdftools)
file <- tempfile()
url <- paste0("http://images.mofcom.gov.cn/fec/202211/20221118091910924.pdf")
download.file(url, file, headers = c("User-Agent" = "My Custom User Agent"))
pdf_data <- pdf_text(file)
replace_spaces_and_commas <- function(x) {
str_replace_all(x, "[ ,]", "")
}
pdf <- pdf_data[53:71]
tab_pdf <- str_split(pdf, "\n")
for (i in 1:19) {
assign(paste0("tab_pdf_", i), tab_pdf[[i]])
}
the_names <- c("country", "year_2013", "year_2014", "year_2015", "year_2016", "year_2017", "year_2018", "year_2019", "year_2020", "year_2021")
view(tab_pdf_1)
pdf_clean1 <- tab_pdf_1[14:60] %>%
str_trim %>%
str_replace_all(",", "") %>%
str_split("\\s{2,}", simplify = TRUE) %>%
data.frame(stringsAsFactors = FALSE) %>%
setNames(the_names) %>% mutate_all(.funs = replace_spaces_and_commas) %>% filter(country != "")
I tried both, e.g., as.numeric(pdf_clean1$year_2013)
andparse_number(pdf_clean$year_2013)
Both produced NAs, because the outcome for all of "9" == "9" "-122" == "-122" "29458" == "29458"
are "FALSE".
The result of dput(head(pdf_clean1$year_2013))
is
c("-122", "29458", "-74", "16357", "2", "-534")
sessionInfo()
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] countrycode_1.5.0 magrittr_2.0.3 pdftools_3.3.3
[4] lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[7] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4
[10] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
[13] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.3 compiler_4.3.1 qpdf_1.3.2
[4] tidyselect_1.2.0 Rcpp_1.0.11 scales_1.2.1
[7] R6_2.5.1 generics_0.1.3 knitr_1.42
[10] munsell_0.5.0 pillar_1.9.0 tzdb_0.4.0
[13] rlang_1.1.1 utf8_1.2.3 stringi_1.7.12
[16] xfun_0.39 timechange_0.2.0 cli_3.6.1
[19] withr_2.5.0 grid_4.3.1 rstudioapi_0.15.0
[22] hms_1.1.3 askpass_1.1 lifecycle_1.0.3
[25] vctrs_0.6.3 glue_1.6.2 fansi_1.0.4
[28] colorspace_2.1-0 tools_4.3.1 pkgconfig_2.0.3```
A base R approach using utf8ToInt
to get an integer from the utf8 code, then subtracting 65248 from integers that are in a specific range (> 126, see ascii table) to get the desired number, finally bringing the integer back to utf8 with intToUtf8
.
as.numeric(
unlist(strsplit(paste(
sapply(unlist(strsplit("-122, 29458, 9", "")), \(x)
ifelse(utf8ToInt(x) > 126, intToUtf8(utf8ToInt(x) - 65248), x)),
collapse=""), ", ")))
[1] -122 29458 9