Search code examples
rdataframedatedata-wrangling

dataframe breakdown by year


I have a dataset on county executives and their year of inaguration. I need break down which year each executive was inaugurated.

The problem is that the notation under the "year" variable is inconsistent.

For instance, let's say I start with this:

df <- data.frame(year= c(2000, "from 2001 to 2002", "01-feb-2003", 2000, "01-jan-2002", "from 2004 to 2005"),
                  executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
                  district= rep(c(1001, 1002), each=3))

I want it to look like this

df.neat <- data.frame(year= c(2000, 2001, 2003, 2000, 2002, 2004),
                  executive.name= c("Johnson", "Smith", "Alleghany", "Roberts", "Clarke", "Tollson"),
                  district= rep(c(1001, 1002), each=3))

Note how the innaguration cycle does not always align (2000, 2001, and 2003 for district 1001 and 2000, 2002, and 2004 for district 1002).


Solution

  • library(dplyr)
    library(stringr)
    
    df |>
      mutate(year = as.numeric(str_extract(year, "\\d{4}")))
    #   year executive.name district
    # 1 2000        Johnson     1001
    # 2 2001          Smith     1001
    # 3 2003      Alleghany     1001
    # 4 2000        Roberts     1002
    # 5 2002         Clarke     1002
    # 6 2004        Tollson     1002