Search code examples
rregexdatedate-formatting

Regex to extract dates of various formats from URLs


I need to extract the date from a dataframe of urls like these:

id | url
1 | https://www.infobae.com/tecno/2018/08/22/una-plataforma-argentina-entre-las-10-soluciones-de-big-data-mas-destacadas-del-ano/
2 | https://www.infobae.com/2014/08/03/1584584-que-es-data-lake-y-como-transforma-el-almacenamiento-datos/
3 | http://www.ellitoral.com/index.php/diarios/2018/01/09/economia1/ECON-02.html
4 | http://www.ellitoral.com/index.php/diarios/2017/12/01/economia1/ECON-01.html
5 | https://www.cronista.com/contenidos/2017/08/16/noticia_0089.html
6 | https://www.cronista.com/contenidos/2017/04/20/noticia_0090.html
7 | https://www.perfil.com/noticias/economia/supercomputadoras-para-sacarles-provecho-a-los-datos-20160409-0023.phtml
8 | https://www.mdzol.com/sociedad/100-cursos-online-gratuitos-sobre-profesiones-del-futuro-20170816-0035.html
9 | https://www.eldia.com/nota/2018-8-26-7-33-54--pueden-nuestros-datos-ponernos-en-serio-riesgo--revista-domingo
10 | https://www.letrap.com.ar/nota/2018-8-6-13-34-0-lula-eligio-a-su-vice-o-a-su-reemplazante
11 | https://www.telam.com.ar/notas/201804/270831-los-pacientes-deben-conocer-que-tipo-de-datos-usan-sus-medicos-coinciden-especialistas.html
12 | http://www.telam.com.ar/notas/201804/271299-invierten-100-millones-en-plataforma-de-internet-de-las-cosas.html
13 | http://www.telam.com.ar/notas/201308/30404-realizan-jornadas-sobre-tecnologia-para-gestion-de-datos.php
14 | http://www.telam.com.ar/notas/201701/176163-inteligencia-artificial-lectura-de-diarios.html

these urls have the date in different formats:
links 1-6 use /yyyy/mm/dd/
links 7-8 use -yyyymmdd-
links 9-10 use /yyyy-m-d-
links 11-14 use /yyyymm/

Luckily, these are all numbers (no "Jar" instead of 1).

Is there a regex that could extract them all, or most of them?


Solution

  • I believe the following regular expression does what you want.

    regex <- "\\d{8}|\\d{6}|\\d{4}[^\\d]{1}\\d{2}|\\d{4}[^\\d]{1}\\d{1,2}[^\\d]{1}\\d{1,2}"
    regmatches(URLData$url, regexpr(regex, URLData$url))
    # [1] "2018/08/22" "2014/08/03" "2018/01/09" "2017/12/01" "2017/08/16"
    # [6] "2017/04/20" "20160409"   "20170816"   "2018-8-26"  "2018-8-6"  
    #[11] "201804"     "201804"     "201308"     "201701"
    

    Edit.

    After reading the answer by @hrbrmstr I realized that it is probably better to coerce the results to class Date. I will use external package lubridate to do it.

    d <- regmatches(URLData$url, regexpr(regex, URLData$url))
    d[nchar(d) < 7] <- paste0(d[nchar(d) < 7], "01")
    d <- lubridate::ymd(d)
    d
    # [1] "2018-08-22" "2014-08-03" "2018-01-09" "2017-12-01" "2017-08-16"
    # [6] "2017-04-20" "2016-04-09" "2017-08-16" "2018-08-26" "2018-08-06"
    #[11] "2018-04-01" "2018-04-01" "2013-08-01" "2017-01-01"
    

    Data in dput format.

    URLData <-
    structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 
    13, 14), url = structure(c(10L, 9L, 2L, 1L, 7L, 6L, 13L, 12L, 
    8L, 11L, 14L, 5L, 3L, 4L), .Label = c(" http://www.ellitoral.com/index.php/diarios/2017/12/01/economia1/ECON-01.html", 
    " http://www.ellitoral.com/index.php/diarios/2018/01/09/economia1/ECON-02.html", 
    " http://www.telam.com.ar/notas/201308/30404-realizan-jornadas-sobre-tecnologia-para-gestion-de-datos.php", 
    " http://www.telam.com.ar/notas/201701/176163-inteligencia-artificial-lectura-de-diarios.html                      ", 
    " http://www.telam.com.ar/notas/201804/271299-invierten-100-millones-en-plataforma-de-internet-de-las-cosas.html", 
    " https://www.cronista.com/contenidos/2017/04/20/noticia_0090.html", 
    " https://www.cronista.com/contenidos/2017/08/16/noticia_0089.html", 
    " https://www.eldia.com/nota/2018-8-26-7-33-54--pueden-nuestros-datos-ponernos-en-serio-riesgo--revista-domingo", 
    " https://www.infobae.com/2014/08/03/1584584-que-es-data-lake-y-como-transforma-el-almacenamiento-datos/", 
    " https://www.infobae.com/tecno/2018/08/22/una-plataforma-argentina-entre-las-10-soluciones-de-big-data-mas-destacadas-del-ano/", 
    " https://www.letrap.com.ar/nota/2018-8-6-13-34-0-lula-eligio-a-su-vice-o-a-su-reemplazante", 
    " https://www.mdzol.com/sociedad/100-cursos-online-gratuitos-sobre-profesiones-del-futuro-20170816-0035.html", 
    " https://www.perfil.com/noticias/economia/supercomputadoras-para-sacarles-provecho-a-los-datos-20160409-0023.phtml", 
    " https://www.telam.com.ar/notas/201804/270831-los-pacientes-deben-conocer-que-tipo-de-datos-usan-sus-medicos-coinciden-especialistas.html"
    ), class = "factor")), class = "data.frame", row.names = c(NA, 
    -14L))