Search code examples
rdatedata-conversionredcap

Trying to extract a date from a 5 or 6-digit number


I am trying to extract a date from a number. The date is stored as the first 6 digits of a 11-digit personal ID-number (date-month-year). Unfortunately the cloud-based database (REDCap) output of this gets formatted as a number, so that the leading zero in those born on the first nine days of the month end up with a 10 digit ID number instead of a 11 digit one. I managed to extract the 6 or 5 digit number corresponding to the date, i.e. 311230 for 31st December 1930, or 11230 for first December 1930. I end up with two problems that I have not been able to solve.

Let's say we use the following numbers:

dato <- c(311230, 311245, 311267, 311268, 310169, 201104, 51230, 51269, 51204)

I convert these into string, and then apply the as.Date() function:

datostr <- as.character(dato)
datofinal <- as.Date(datostr, "%d%m%y")
datofinal

The problems i have are:

  • Five-digit numbers (eg 11230) gets reported as NA.
  • Six-digit numbers are recognized, but those born before 1.1.1969 gets reported with 100 years added, i.e. 010160 gets converted to 2060.01.01

I am sure this must be easy for those who are more knowledgeable about R, but, I struggle a bit solving this. Any help is greatly appreciated.

Greetings Bjorn


Solution

  • If your 5-digit numbers really just need to be zero-padded, then

    dato_s <- sprintf("%06d", dato)
    dato_s
    # [1] "311230" "311245" "311267" "311268" "310169" "201104" "051230" "051269" "051204"
    

    From there, your question about "dates before 1969", take a look at ?strptime for the '%y' pattern:

     '%y' Year without century (00-99).  On input, values 00 to 68 are
          prefixed by 20 and 69 to 99 by 19 - that is the behaviour
          specified by the 2018 POSIX standard, but it does also say
          'it is expected that in a future version the default century
          inferred from a 2-digit year will change'.
    

    So if you have specific alternate years for those, you need to add the century before sending to as.Date (which uses strptime-patterns).

    dato_d <- as.Date(gsub("([0-4][0-9])$", "20\\1",
                           gsub("([5-9][0-9])$", "19\\1", dato_s)),
                      format = "%d%m%Y")
    dato_d
    # [1] "2030-12-31" "2045-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
    # [7] "2030-12-05" "1969-12-05" "2004-12-05"
    

    In this case, I'm assuming 50-99 will be 1900, everything else 2000. If you need 40s or 30s, feel free to adjust the pattern: add digits to the second pattern (e.g., [3-9]) and remove from the first pattern (e.g., [0-2]), ensuring that all decades are included in exactly one pattern, not "neither" and not "both".

    Borrowing from Allan's answer, I like that assumption of now() (since you did mention "born on"). Without lubridate, try this:

    dato_s <- sprintf("%06d", dato)
    dato_d <- as.Date(dato_s, format = "%d%m%y")
    dato_d[ dato_d > Sys.Date() ] <-
      as.Date(sub("([0-9]{2})$", "19\\1", dato_s[ dato_d > Sys.Date() ]), format = "%d%m%Y")
    dato_d
    # [1] "1930-12-31" "1945-12-31" "1967-12-31" "1968-12-31" "1969-01-31" "2004-11-20"
    # [7] "1930-12-05" "1969-12-05" "2004-12-05"