Search code examples
rreshape

reshape dataframe from wide to long in R


I have a dataframe including around 300 columns. These columns are grouped. Here is the example:

id  name fixed_2020 fixed_2019 fixed_2018 fixed_2017 fixed_2016 current_2020  current_2019 current_2018 current_2017 current_2016
 1     A       2300       2100       2600       2600       1900         3000          3100         3200         3300         3400

I have around 20000 rows in total in the dataframe. And I would like to reshape this dataframe from wide to long in R. I tried to use function melt:

fixed <- melt(mydata, id.vars = c('id', 'name'), measure.vars = 3:7, variable.name = 'year', value.name = 'fixed')

Then I use gsub to get the year column

fixed$year <- as.character(gsub("[^0-9-]", "", debtors$year))

Here is what I want:

id   name  year  fixed  current
 1   A     2020  2030    3000
 2   A     2019  2100    3100
 3   A     2018  2600    3200
 4   A     2017  2600    3300
 5   A     2016  1900    3400

While it does give me what I want, the process is time-consuming when I have more than 300 columns. Since my dataframe is grouped according to the variables and years (10 years for each variable), I wonder whether there are other functions to do it more efficiently.

Thanks a lot in advance!


Solution

  • Using data.table:

    library(data.table)
    setDT(mydata)
    result <- melt(mydata, id=c('id', 'name'), 
                     measure.vars = patterns(fixed='fixed_', current='current_'), 
                     variable.name = 'year')
    years <- as.numeric(gsub('.+_(\\d+)', '\\1', grep('fixed', names(mydata), value = TRUE)))
    result[, year:=years[year]]
    result[, id:=seq(.N), by=.(name)]
    result
    ##    id name year fixed current
    ## 1:  1    A 2020  2300    3000
    ## 2:  2    A 2019  2100    3100
    ## 3:  3    A 2018  2600    3200
    ## 4:  4    A 2017  2600    3300
    ## 5:  5    A 2016  1900    3400
    

    This should be very fast but your data-set is not very big tbh.

    Note that this assumes the fixed and current columns are in the same order and associated with the same year(s). So if there is a fixed_2020 as the first fixed_* column, there is also a current_2020 as the first current_* column, and so on. Otherwise, the year column will correctly associate with fixed but not current