Search code examples
rdataframespread

Spread by Two Columns


I am trying to spread a data frame, but I am not quite familiar with spread() and gather().

Below is a sample of my data. It has 9 rows all with the same Application.Number. I would like to end up with one row per Application.Number-Decicion combination. The remaining variables date_generated date_decided time_to_decision and text have to be repeated for each Application.Number-Decicion combination or the last one should be taken. The data is already sorted by Application.Number and date_generated.

structure(list(Application.Number = c(80749L, 80749L, 80749L, 
80749L, 80749L, 80749L, 80749L, 80749L, 80749L), Decision = c("Invalid", 
"Invalid", "Invalid", "Invalid", "Invalid", "Invalid", "Approved", 
"Approved", "Approved"), date_generated = structure(c(1521810060, 
1521810060, 1523523840, 1523536500, 1524036720, 1524136380, 1524137460, 
1524137460, 1524137460), class = c("POSIXct", "POSIXt"), tzone = ""), 
    date_decided = structure(c(1522155960, 1522155660, 1523534400, 
    1523600520, 1524127140, 1524136740, 1524211800, 1524211740, 
    1524211200), class = c("POSIXct", "POSIXt"), tzone = ""), 
    time_to_decision = c(4.00347222222222, 4, 0.122222222222222, 
    0.740972222222222, 1.04652777777778, 0.00416666666666667, 
    0.860416666666667, 0.859722222222222, 0.853472222222222), 
    text = c("rIUQRmOkyZ", "ZxdYUr16NR", "8IIipoleOV", "nLuIgToxcT", 
    "xYFksrws87", "N2oECMtgQo", "RKcrBcBFI2", "jaH438byVt", "80ggA2hZr7"
    )), row.names = 15880:15888, class = "data.frame")

EDIT: Decided that the output should be just one row and all rows should pivot around Application.Number.

I ended up making a separate data frame with the duplicates and joining it back to the unique rows.

There must be a better way to do it.


Solution

  • Well, you could do it, but I agree with user 42-, it will lead to problems because of data format later:

    > gather(x, "key", "val", -Application.Number, -Decision)
       Application.Number Decision              key                 val
    1               80749  Invalid   date_generated          1521810060
    2               80749  Invalid   date_generated          1521810060
    3               80749  Invalid   date_generated          1523523840
    4               80749  Invalid   date_generated          1523536500
    5               80749  Invalid   date_generated          1524036720
    6               80749  Invalid   date_generated          1524136380
    7               80749 Approved   date_generated          1524137460
    8               80749 Approved   date_generated          1524137460
    9               80749 Approved   date_generated          1524137460
    10              80749  Invalid     date_decided          1522155960
    11              80749  Invalid     date_decided          1522155660
    12              80749  Invalid     date_decided          1523534400
    13              80749  Invalid     date_decided          1523600520
    14              80749  Invalid     date_decided          1524127140
    15              80749  Invalid     date_decided          1524136740
    16              80749 Approved     date_decided          1524211800
    17              80749 Approved     date_decided          1524211740
    18              80749 Approved     date_decided          1524211200
    19              80749  Invalid time_to_decision    4.00347222222222
    20              80749  Invalid time_to_decision                   4
    21              80749  Invalid time_to_decision   0.122222222222222
    22              80749  Invalid time_to_decision   0.740972222222222
    23              80749  Invalid time_to_decision    1.04652777777778
    24              80749  Invalid time_to_decision 0.00416666666666667
    25              80749 Approved time_to_decision   0.860416666666667
    26              80749 Approved time_to_decision   0.859722222222222
    27              80749 Approved time_to_decision   0.853472222222222
    28              80749  Invalid             text          rIUQRmOkyZ
    29              80749  Invalid             text          ZxdYUr16NR
    30              80749  Invalid             text          8IIipoleOV
    31              80749  Invalid             text          nLuIgToxcT
    32              80749  Invalid             text          xYFksrws87
    33              80749  Invalid             text          N2oECMtgQo
    34              80749 Approved             text          RKcrBcBFI2
    35              80749 Approved             text          jaH438byVt
    36              80749 Approved             text          80ggA2hZr7
    Warning:
    attributes are not identical across measure variables;
    they will be dropped 
    

    The warning already is a hint: You have converted all your value columns data_generated, date_decided, time_to_decision and text to the most general data format which can hold all these values: Character strings. See how your dates were converted to seconds since the epoch: You lost for example time zone information.

    So in short, you can do it but I don't think you should. Because you did not show your use case or any context, I can't propose a better solution though.,