Search code examples
rtidyversetidyrreshape2

Reshaping a dataframe with a huge character column into several numeric columns in R


I have a dataset with a complex column results as (character class).

Data sample is here.

date_agenda id_question id_event results
2020-09-04  202009043   7426      "2:1:3|3:1:3|4:8:4|5:1:4|6:7:0|7:7:4|8:7:3|9:3:0|10:8:4|11:1:0|12:8:4|13:8:4|14:8:1|15:1:3|16:1:0|17:1:3|18:1:3|"
2020-09-04  202009044   7427      "2:1:3|3:1:3|4:8:4|5:1:4|6:7:0|7:7:4|8:7:3|9:3:0|10:8:4|11:1:0|12:8:4|13:8:4|14:8:3|15:1:3|16:1:0|17:1:3|18:1:3|"
2020-09-04  202009046   7428      "2:1:1|3:1:1|4:8:4|5:1:4|6:7:0|7:7:4|8:7:1|9:3:0|10:8:4|11:1:0|12:8:4|13:8:4|14:8:3|15:1:1|16:1:0|17:1:1|18:1:1|"

I would like to convert results and the whole dataframe from wide to long format, splitting the last column results into three numeric columns.

For example, 2:1:3 is one event in results and there hundreds of such events in this column.

My desired output:

date_agenda id_question id_event mp_id mp_fraction result
2020-09-04  202009043   7426       2         1        3
2020-09-04  202009043   7426       3         1        3
2020-09-04  202009043   7426       4         8        4
2020-09-04  202009043   7426       5         1        4
2020-09-04  202009043   7426       6         7        0

Basically, one row in the initial wide format (4 columns) will be converted into hundreds of rows in the long format (6 columns).

I understand that it should be done with pivot_longer, pivot_wider (or gather and spread) and maybe melt, but not sure that I have the logic now. Whether I should split the character first into hundreds of columns and then make it wide or something else.

Appreciate tips! Thanks!


Solution

  • We can use separate_rows to split results on "|" into different rows and separate to split on ":" in three different columns.

    library(dplyr)
    library(tidyr)
    
    df %>%
      separate_rows(results, sep = '\\|') %>%
      #To remove empty values generated from last "|"
      filter(results != '') %>%
      separate(results,c("mp_id", "mp_fraction", "result"), sep = ":",convert = TRUE)
    
    # A tibble: 51 x 6
    #   date_agenda id_question id_event mp_id mp_fraction result
    #   <chr>             <int>    <int> <int>       <int>  <int>
    # 1 2020-09-04    202009043     7426     2           1      3
    # 2 2020-09-04    202009043     7426     3           1      3
    # 3 2020-09-04    202009043     7426     4           8      4
    # 4 2020-09-04    202009043     7426     5           1      4
    # 5 2020-09-04    202009043     7426     6           7      0
    # 6 2020-09-04    202009043     7426     7           7      4
    # 7 2020-09-04    202009043     7426     8           7      3
    # 8 2020-09-04    202009043     7426     9           3      0
    # 9 2020-09-04    202009043     7426    10           8      4
    #10 2020-09-04    202009043     7426    11           1      0
    # … with 41 more rows