Search code examples
rlocalestrptime

Temporarily change locale settings


Actual question

How can I temporarily change/specify the locale settings to be used for certain function calls (e.g. strptime())?

Background

I just ran the following rvest demo:

demo("tripadvisor", package = "rvest")

When it comes to the part where the dates are to be scraped, I run into some problems that most likely are caused by my locale settings: the dates are in an US american format while I'm on a German locale:

require("rvest")
url <- "http://www.tripadvisor.com/Hotel_Review-g37209-d1762915-Reviews-JW_Marriott_Indianapolis-Indianapolis_Indiana.html"

reviews <- url %>%
  html() %>%
  html_nodes("#REVIEWS .innerBubble")

date <- reviews %>%
  html_node(".rating .ratingDate") %>%
  html_attr("title")
> date
 [1] "December 9, 2014" "December 9, 2014" "December 8, 2014" "December 8, 2014"
 [5] "December 6, 2014" "December 5, 2014" "December 5, 2014" "December 3, 2014"
 [9] "December 3, 2014" "December 3, 2014"

Based on this output, I would use the following format: %B %e, %Y (or %B%e, %Y depending on what "with a leading space for a single-digit number" actually means WRT to the leading space; see ?strptime).

Yet, both fails:

strptime(date, "%B %e, %Y")
strptime(date, "%B%e, %Y")

I suppose it's due to the fact that %B expects the month names to be in German instead of English:

Full month name in the current locale. (Also matches abbreviated name on input.)


EDIT

Sys.setlocale() let's you change your locale settings. But it seems that it's not possible to do so after a function relying on locale settings has been called. I.e., you need to start with a fresh R session in order for the locale change to take effect. This makes temporary changes a bit cumbersome. Any ideas how to work around this?

This is my locale:

> Sys.getlocale(category = "LC_ALL")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"

When I change it before running strptime() for the first time, everything works just fine:

Sys.setlocale(category = "LC_ALL", locale = "us")
> strptime(date, "%B %e, %Y")
 [1] "2014-12-09 CET" "2014-12-09 CET" "2014-12-08 CET" "2014-12-08 CET" "2014-12-06 CET"
 [6] "2014-12-05 CET" "2014-12-05 CET" "2014-12-03 CET" "2014-12-03 CET" "2014-12-03 CET"

However, if I change it after having run stptime(), the change does not seem to be recognized

> Sys.setlocale(category = "LC_ALL", locale = "German")
[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
> strptime(date, "%B %e, %Y")
 [1] "2014-12-09 CET" "2014-12-09 CET" "2014-12-08 CET" "2014-12-08 CET" "2014-12-06 CET"
 [6] "2014-12-05 CET" "2014-12-05 CET" "2014-12-03 CET" "2014-12-03 CET" "2014-12-03 CET"

This should actually result in a vector of NAs if the change back to a German locale had been carried out.


Solution

  • parse_date_time() from the lubridate package is what you are looking for. It has an explicit locale option for parsing strings according to a specific locale.

    parse_date_time(date, orders = "B d, Y", locale = "us")
    

    gives you:

    [1] "2016-02-26 UTC" "2016-02-26 UTC" "2016-02-26 UTC" "2016-02-24 UTC" "2016-02-23 UTC" "2016-02-21 UTC"
    [7] "2016-02-21 UTC" "2016-02-21 UTC" "2016-02-20 UTC" "2016-02-20 UTC"
    

    Note that you give the parsing format without leading %as you would in strptime().