Search code examples
regexrdigit

R digit-expression and unlist doesn't work


So I've bought a book on R and automated data collection, and one of the first examples are leaving me baffled.

I have a table with a date-column consisting of numbers looking like this "2001-". According to the tutorial, the line below will remove the "-" from the dates by singling out the first four digits:

yend_clean <- unlist(str_extract_all(danger_table$yend, "[[:digit:]]4$"))

When I run this command, "yend_clean" is simply set to "character (empty)".

If I remove the ”4$", I get all of the dates split into atoms so that the list that originally looked like this "1992", "2003" now looks like this "1", "9" etc.

So I suspect that something around the "4$" is the problem. I can't find any documentation on this that helps me figure out the correct solution.

Was hoping someone in here could point me in the right direction.


Solution

  • This is a regular expression question. Your regular expression is wrong. Use:

    unlist(str_extract_all("2003-", "^[[:digit:]]{4}"))
    

    or equivalently

    sub("^(\\d{4}).*", "\\1", "2003-")
    

    of if really all you want is to remove the "-"

    sub("-", "", "2003-")
    

    Repetition in regular expressions is controlled by the {} parameter. You were missing that. Additionally $ means match the end of the string, so your expression translates as:

    match any single digit, followed by a 4, followed by the end of the string

    When you remove the "4", then the pattern becomes "match any single digit", which is exactly what happens (i.e. you get each digit matched separately).

    The pattern I propose says instead:

    match the beginning of the string (^), followed by a digit repeated four times.

    The sub variation is a very common technique where we create a pattern that matches what we want to keep in parentheses, and then everything else outside of the parentheses (.* matches anything, any number of times). We then replace the entire match with just the piece in the parens (\\1 means the first sub-expression in parentheses). \\d is equivalent to [[:digit:]].