I'm trying to scrape russian media web pages that contains Cyrillic text using the rvest package in R.
However, for some of the pages (not all for some reason) I'm encountering an encoding issue where the text does not display correctly after scraping. Instead of the expected Cyrillic characters, I see a garbled text output like:
Ðлава ÐдеÑÑÑ Ð¿ÑоÑÐ¸Ñ Ð²Ð»Ð°ÑÑи ÑÑÑÐ°Ð½Ñ ÑеÑÑÑ Ð·Ð° ÑÑол пеÑеговоÑов Ñ Ð Ð¾ÑÑией
Take this page for example:
url <- "https://news-front.su/2022/08/29/glava-odessy-prosit-vlasti-strany-sest-za-stol-peregovorov-s-rossiej/"
Both the page header (httr::headers(httr::HEAD(url))
) and all parameters in the html script tell me the encoding should be UTF-8 and the page is static.
Originally, my scraper did not specify the encoding (so it should use UTF-16 if UTF-8 throws an error I guess?!), resulting the character mess above.
Specifying Encoding in read_html():
I attempted to specify the encoding directly in the read_html() function:
text <- rvest::read_html(url, encoding = "UTF-8") %>% html_elements(".entry-title") %>% html_text2()
which results in
Error in read_xml.raw(raw, encoding = encoding, base_url = base_url, as_html = as_html, : Input is not proper UTF-8, indicate encoding ! Bytes: 0xD0 0x27 0x20 0x2F [9]
I also tried other encodings like "windows-1251" and "UTF-16" (and looped over all the encodings from stringi::stri_enc_list()
), but that didn't get me closer to resolve the issue.
Ex-post string manipulation:
This was the closest I could get to the intended result (though still not perfect and ideally not necessary if the scraper can handel the encoding in the first place).
I did this directly in the MariaDB I write the scraped text into using SQL:
CREATE TEMPORARY TABLE ttable (text VARCHAR(255) CHARACTER SET utf8);
INSERT INTO ttable (text) VALUES
('Ðлава ÐдеÑÑÑ Ð¿ÑоÑÐ¸Ñ Ð²Ð»Ð°ÑÑи ÑÑÑÐ°Ð½Ñ ÑеÑÑÑ Ð·Ð° ÑÑол пеÑеговоÑов Ñ Ð Ð¾ÑÑией');
SELECT text, CONVERT(CAST(CONVERT(text USING latin1) AS BINARY) USING utf8) AS corrected_text FROM ttable;
and got close:
??лава ??десс?? п??оси?? влас??и с????ан?? сес???? за с??ол пе??егово??ов с ? оссией
instead of
Глава Одессы просит власти страны сесть за стол переговоров с Россией
as displayed on the Website. I could probably get to the right result from here using a LLM or so, but prefer to avoid scraping messy data in the first place...
Has anyone faced a similar encoding issue when scraping Cyrillic texts? Any help would be hugely appreciated!
[I'm using R 4.3.2 and rvest 1.0.3 on Windows 10]
rvest
Apparently one of the <meta>
elements includes invalid UTF-8 byte sequence.
From W3 Nu Html Checker report:
Error: Malformed byte sequence:
d0
.
At line 133, column 174
Line 133:
<meta name="verification" content="f612c7d25f5690ad41496fcfdbf8d1" /><meta name='description' content='<strong>Мэр Одессы Геннадий Труханов призвал киевские власти вести пер�' /> <!-- Yandex.Metrica counter -->
col 174 *
To "fix" or replace incorrect code points for rvest
/ xml2
, we could load content as raw bytes through httr2
, replace offending byte(s) with stringi::stri_conv()
and only then let rvest
parse it.
library(rvest)
library(httr2)
library(stringi)
url_ <- "https://news-front.su/2022/08/29/glava-odessy-prosit-vlasti-strany-sest-za-stol-peregovorov-s-rossiej/"
resp <-
request(url_) |>
req_perform()
resp_header(resp, "Content-Type")
#> [1] "text/html; charset=UTF-8"
# attempt to parse response as-is
# ( resp_body_html just calls resp_body_raw(resp) |> xml2::read_html() )
resp_body_html(resp) |>
html_elements(".entry-title") |>
html_text2()
#> [1] "Ð\u0093лава Ð\u009eдеÑ\u0081Ñ\u0081Ñ\u008b пÑ\u0080оÑ\u0081иÑ\u0082 влаÑ\u0081Ñ\u0082и Ñ\u0081Ñ\u0082Ñ\u0080анÑ\u008b Ñ\u0081еÑ\u0081Ñ\u0082Ñ\u008c за Ñ\u0081Ñ\u0082ол пеÑ\u0080еговоÑ\u0080ов Ñ\u0081 РоÑ\u0081Ñ\u0081ией"
# replace incorrect code points with 'missing/erroneous' character,
# warning is expected, `?stringi::stri_conv` for details
text_utf8 <-
resp_body_raw(resp) |>
stri_conv(from = "UTF-8", to = "UTF-8")
#> Warning in stri_conv(resp_body_raw(resp), from = "UTF-8", to = "UTF-8"): input
#> data \xffffffd0 in the current source encoding could not be converted to
#> Unicode
read_html(text_utf8) |>
html_elements(".entry-title") |>
html_text2()
#> [1] "Глава Одессы просит власти страны сесть за стол переговоров с Россией"
We can locate errors by looking for replacements (\ufffd
& \u001a
) in output string:
# locate replacement(s)
(err <- stri_locate_all_regex(text_utf8, '[\ufffd\u001a]'))
#> [[1]]
#> start end
#> [1,] 27152 27152
err_elem <-
stri_sub(text_utf8, err[[1]][,"start"] - 200, err[[1]][,"end"] + 10) |>
stri_extract_last_regex("<[^/]+/>")
err_elem
#> [1] "<meta name='description' content='<strong>Мэр Одессы Геннадий Труханов призвал киевские власти вести пер�' />"
# check escape sequences to locate our replacements (\ufffd, \u001a)
stri_escape_unicode(err_elem)
#> [1] "<meta name=\\'description\\' content=\\'<strong>\\u041c\\u044d\\u0440 \\u041e\\u0434\\u0435\\u0441\\u0441\\u044b \\u0413\\u0435\\u043d\\u043d\\u0430\\u0434\\u0438\\u0439 \\u0422\\u0440\\u0443\\u0445\\u0430\\u043d\\u043e\\u0432 \\u043f\\u0440\\u0438\\u0437\\u0432\\u0430\\u043b \\u043a\\u0438\\u0435\\u0432\\u0441\\u043a\\u0438\\u0435 \\u0432\\u043b\\u0430\\u0441\\u0442\\u0438 \\u0432\\u0435\\u0441\\u0442\\u0438 \\u043f\\u0435\\u0440\\ufffd\\' />"
Last character in content string is now replaced with \ufffd
, a replacement character � . Browsers handle this in a somewhat similar way and the use of \ufffd
, when handling parsing errors, is described in HTML5 standard .
When extracting content through chromote
, as was proposed in first revision, we get a string where invalid sequences are already replaced with the same \ufffd
.
We can also try to split response content into lines, check those with stringi::stri_enc_isutf8()
to get offending line(s) where invalid sequence is still present. While string operations will not work on invalid byte sequences, vroom
/readr
can still handle this raw vector and split it into lines for us:
l <-
resp_body_raw(resp) |>
vroom::vroom_lines()
(err_idx <- which(!stri_enc_isutf8(l)))
#> [1] 133
l[err_idx]
#> [1] "<meta name=\"verification\" content=\"f612c7d25f5690ad41496fcfdbf8d1\" /><meta name='description' content='<strong>Мэр Одессы Геннадий Труханов призвал киевские власти вести пер\xd0' /> <!-- Yandex.Metrica counter -->"
Last word in description content occupies 7 bytes which doesn't look right in this context. And last byte value is 0xD0
(b1101 0000
), which indicates that it should be 1st byte of a 2-byte character.
They seem to truncate description content at byte not at character or word boundary, so in some cases only first part of a multibyte character is left hanging there, creating an invalid UTF-8 string.
chromote
One possible workaround would be using chromote
to fetch page content. It could be done through rvest::read_html_live()
, but this would also load all linked resources and evaluate JavaScript.
To grab only main page content, we could evaluate our own JavaScript in Chrome session to use Fetch API, for example:
library(rvest)
library(chromote)
url_ <- "https://news-front.su/2022/08/29/glava-odessy-prosit-vlasti-strany-sest-za-stol-peregovorov-s-rossiej/"
fetch_content <- function(chromote_session, url_, timeout = 10000){
chromote_session$Runtime$evaluate(
glue::glue('fetch("{url_}").then(response => response.text());'),
awaitPromise = TRUE,
timeout = timeout
)$result$value
}
b <- ChromoteSession$new()
fetch_content(b, url_) |>
read_html() |>
html_elements(".entry-title") |>
html_text2()
#> [1] "Глава Одессы просит власти страны сесть за стол переговоров с Россией"