Search code examples
rsortingmultiple-columnsxlsxcolumnsorting

How to distribute parts of a column besides each other using R?


I have an .xlsx document containing some data and measurements that is within 3 columns (ImageNumber, ObjectNumber, Intensity) The problem is, that this data is all in just one column as it is shown here:

263     2   347.92942202463746
264     2   340.47059811465442
265     2   626.37256725877523
266     2   352.60785254277289
267     2   1161.9843464940786
268     2   353.31373599730432
269     2   1164.090231411159
270     2   639.38041111640632
271     2   365.32550028897822
272     2   373.7215790450573
273     2   611.34119655750692
274     2   342.07451999932528
275     2   343.72550031356514
276     2   602.51766622252762
277     2   335.52942184358835
278     2   347.39216740056872
279     2   349.49412823654711
280     2   365.96079528704286
281     2   603.77256822399795
282     2   403.58432539924979
283     2   633.00001835078001
284     2   390.50589356571436
285     2   387.1451101154089
1       3   94.176473506726325
2       3   79.400002629496157
3       3   331.84314792603254
4       3   1152.6863025426865
5       3   1186.6627745330334
6       3   470.21962223947048
7       3   513.58432994037867
8       3   501.98040856420994
9       3   497.15687815099955
10      3   440.45099299959838
11      3   442.16471740975976
12      3   1270.5686648786068
13      3   1296.145133793354
14      3   592.69021038152277
15      3   1247.9529772102833
16      3   1304.1843515634537
17      3   1317.5176855623722
18      3   566.2706073410809
19      3   555.8470722027123
20      3   953.59217982552946
21      3   445.65883476100862
22      3   438.89020615816116
23      3   1410.3608229905367
24      3   426.01569781638682
25      3   1424.8588645160198
26      3   1416.5294532775879
27      3   1253.4470970630646
28      3   422.10197120346129
29      3   1272.7372958958149
30      3   498.68629035539925
31      3   464.75687384977937
32      3   374.47452012635767
33      3   402.48628707416356
34      3   508.00393660180271
35      3   405.66275736689568
36      3   498.54511260986328

This is just a very small extract of my measurements. However, to be able to analyse this data, I need it in the following format, where the different data sets are beside each other:

ImageNumber ObjectNumber    Intensity   ImageNumber ObjectNumber    Intensity   ImageNumber ObjectNumber    Intensity   ImageNumber ObjectNumber    Intensity
1   1   2385.494163364172   1   2   30.200001035351306  1   3   522.71766421943903  1   4   1057.6157233268023
2   1   479.47844552993774  2   2   28.882353894878179  2   3   1007.6078772544861  2   4   461.65491861104965
3   1   391.68236282467842  3   2   27.615687115117908  3   3   907.86276851594448  3   4   416.80001404881477
4   1   25.168628226500005  4   2   762.15687701106071  4   3   360.51765584945679  4   4   745.08237412944436
5   1   32.286275403108448  5   2   735.21570462733507  5   3   370.90589284710586  5   4   95.643139901570976
6   1   29.819608668331057  6   2   680.78825412690639  6   3   357.29804827086627  6   4   91.490198460407555
7   1   63.627452164888382  7   2   746.64315531402826  7   3   441.45099106151611  7   4   131.12157137878239
8   1   57.643138359300792  8   2   391.56863641180098  8   3   550.72550706192851  8   4   805.54511855356395
9   1   54.403922646306455  9   2   386.09804813098162  9   3   339.52549867797643  9   4   506.53334950841963
10  1   485.22354453988373  10  2   574.22747141867876  10  3   1228.5804251618683  10  4   1256.5176827311516
11  1   382.1568714408204   11  2   545.84315247740597  11  3   1212.9255175394937  11  4   1363.251015804708
12  1   396.752954300493    12  2   571.52942893654108  12  3   377.77256011217833  12  4   729.58433585613966
13  1   1283.8667007293552  13  2   542.1764866374433   13  3   706.82747261226177  13  4   648.21178455650806
14  1   430.46275778114796  14  2   909.63139714486897  14  3   451.46668002009392  14  4   1028.8941485583782
15  1   602.85491912066936  15  2   347.98432378470898  15  3   396.26667900010943  15  4   406.35295270755887
16  1   421.81961948797107  16  2   837.12943513691425  16  3   478.32942511886358  16  4   1038.5725800022483
17  1   405.13334396108985  17  2   747.52551138773561  17  3   446.17256097495556  17  4   885.80394879728556
18  1   324.09020387381315  18  2   653.02354798838496  18  3   835.43531934171915  18  4   407.7647173628211
19  1   336.67843942344189  19  2   804.93727961182594  19  3   429.20393324270844  19  4   291.43530296906829
20  1   741.53335233777761  20  2   366.6039296798408   20  3   732.14904120564461  20  4   394.81569704227149
21  1   338.82745894789696  21  2   1345.1961118653417  21  3   519.3960902877152   21  4   564.89413283765316
22  1   415.46667850390077  22  2   837.20394962280989  22  3   395.91765884310007  22  4   224.29020238853991
23  1   362.44314773753285  23  2   787.94120307266712  23  3   391.5568740144372   23  4   1794.8980749752373
24  1   789.72158995270729  24  2   374.35295177251101  24  3   708.94512075185776  24  4   381.40393186733127
25  1   386.32942296564579  25  2   687.25100283324718  25  3   373.17255918681622  25  4   430.47844344004989
26  1   319.23530425131321  26  2   564.95687813684344  26  3   429.85099240392447  26  4   289.76079219020903
27  1   312.13726452738047  27  2   480.74903298169374  27  3   440.44314985722303  27  4   373.62746067903936
28  1   630.78825259953737  28  2   470.48236661218107  28  3   288.50981164351106  28  4   270.01961551234126
29  1   340.45099052786827  29  2   648.56472269445658  29  3   427.27059957757592  29  4   1008.9764956980944

As shown in this figure, all the data sets having the same "ObjectNumber" should be in a separate column together with the corresponding value for "Intensity" and "ImageNumber". As there are sometimes more than 100 data sets (with several hundred data points), copying manually one set besides the other is no option, as it takes a lot of time.

I have already solved some other problems regarding data management or alignment by the use of the R package "tidyverse" or "reshape". However, this time I have absolutely no idea how I can solve this problem.

I would be very grateful, if you could help me with this issue.


Solution

  • I've used a function like this in some reports:

    wrap_frame <- function(x, nr, nc, rownames = NULL, byrow = FALSE, sep = "_", unique_names = TRUE) {
      if (!xor(missing(nr), missing(nc))) stop("specify exactly one of 'nr' or 'nc'")
      has_rownames <- isTRUE(is.character(attr(x, "row.names")))
      if (is.null(rownames)) {
        if (missing(rownames) && has_rownames) warning("wrap_frame: row names discarded", call. = FALSE)
      } else {
        x <- cbind.data.frame(list(row.names(x)), x)
        colnames(x)[1] <- rownames
      }
      if (missing(nr)) {
        nr <- ceiling(nrow(x) / nc)
        ind <- c(rep(seq_len(nc), times = nrow(x) %/% nc),
                 head(seq_len(nc), n = nrow(x) %% nc))
      } else {
        nc <- ceiling(nrow(x) / nr)
        ind <- c(rep(seq_len(nrow(x) %/% nr), times = nr),
                 rep(nc, nrow(x) %% nr))
      }
      if (!byrow) ind <- sort(ind)
      lst <- split(x, ind)
      lst <- lapply(lst, lapply, `length<-`, nrow(lst[[1]]))
      cnames <-
        if (unique_names) {
          paste(rep(colnames(x), times = nc), rep(seq_len(nc), each = ncol(x)), sep = sep)
        } else {
          rep(colnames(x), times = nc)
        }
      out <- do.call("cbind.data.frame", lst)
      colnames(out) <- cnames
      out
    }
    

    Some sample data (it's faster this way, sorry I'm not using yours even though you did improve its format!):

    mt <- mtcars[1:3]
    

    And some sample calls, fixed number of rows first:

    wrap_frame(mt, nr = 10)
    # Warning: wrap_frame: row names discarded
    #    mpg_1 cyl_1 disp_1 mpg_2 cyl_2 disp_2 mpg_3 cyl_3 disp_3 mpg_4 cyl_4 disp_4
    # 1   21.0     6  160.0  17.8     6  167.6  21.5     4  120.1  15.0     8    301
    # 2   21.0     6  160.0  16.4     8  275.8  15.5     8  318.0  21.4     4    121
    # 3   22.8     4  108.0  17.3     8  275.8  15.2     8  304.0    NA    NA     NA
    # 4   21.4     6  258.0  15.2     8  275.8  13.3     8  350.0    NA    NA     NA
    # 5   18.7     8  360.0  10.4     8  472.0  19.2     8  400.0    NA    NA     NA
    # 6   18.1     6  225.0  10.4     8  460.0  27.3     4   79.0    NA    NA     NA
    # 7   14.3     8  360.0  14.7     8  440.0  26.0     4  120.3    NA    NA     NA
    # 8   24.4     4  146.7  32.4     4   78.7  30.4     4   95.1    NA    NA     NA
    # 9   22.8     4  140.8  30.4     4   75.7  15.8     8  351.0    NA    NA     NA
    # 10  19.2     6  167.6  33.9     4   71.1  19.7     6  145.0    NA    NA     NA
    wrap_frame(mt, nr = 10, rownames = NULL) # to silence the warning
    

    Fixed number of columns:

    wrap_frame(mt, nc = 7, rownames = NULL)
    #   mpg_1 cyl_1 disp_1 mpg_2 cyl_2 disp_2 mpg_3 cyl_3 disp_3 mpg_4 cyl_4 disp_4 mpg_5 cyl_5 disp_5 mpg_6 cyl_6 disp_6 mpg_7 cyl_7 disp_7
    # 1  21.0     6    160  18.1     6  225.0  17.8     6  167.6  10.4     8  460.0  21.5     4  120.1  19.2     8  400.0  15.8     8    351
    # 2  21.0     6    160  14.3     8  360.0  16.4     8  275.8  14.7     8  440.0  15.5     8  318.0  27.3     4   79.0  19.7     6    145
    # 3  22.8     4    108  24.4     4  146.7  17.3     8  275.8  32.4     4   78.7  15.2     8  304.0  26.0     4  120.3  15.0     8    301
    # 4  21.4     6    258  22.8     4  140.8  15.2     8  275.8  30.4     4   75.7  13.3     8  350.0  30.4     4   95.1  21.4     4    121
    # 5  18.7     8    360  19.2     6  167.6  10.4     8  472.0  33.9     4   71.1    NA    NA     NA    NA    NA     NA    NA    NA     NA
    

    Both of those examples demonstrate some padding: in the first, the number of rows is fixed so the last column is a bit sparse, but only one column has NA values; in the second, the number of columns is fixed, so the last row is half-empty, but only one row has NA values. Both are "balanced" in a sense.

    Same thing, but this time go by-row, meaning the first nc rows of x are spread across the first row of the output:

    wrap_frame(mt, nc = 7, byrow = TRUE, rownames = NULL)
    #   mpg_1 cyl_1 disp_1 mpg_2 cyl_2 disp_2 mpg_3 cyl_3 disp_3 mpg_4 cyl_4 disp_4 mpg_5 cyl_5 disp_5 mpg_6 cyl_6 disp_6 mpg_7 cyl_7 disp_7
    # 1  21.0     6  160.0  21.0     6  160.0  22.8     4  108.0  21.4     6  258.0  18.7     8  360.0  18.1     6  225.0  14.3     8  360.0
    # 2  24.4     4  146.7  22.8     4  140.8  19.2     6  167.6  17.8     6  167.6  16.4     8  275.8  17.3     8  275.8  15.2     8  275.8
    # 3  10.4     8  472.0  10.4     8  460.0  14.7     8  440.0  32.4     4   78.7  30.4     4   75.7  33.9     4   71.1  21.5     4  120.1
    # 4  15.5     8  318.0  15.2     8  304.0  13.3     8  350.0  19.2     8  400.0  27.3     4   79.0  26.0     4  120.3  30.4     4   95.1
    # 5  15.8     8  351.0  19.7     6  145.0  15.0     8  301.0  21.4     4  121.0    NA    NA     NA    NA    NA     NA    NA    NA     NA
    

    And we can make it more aesthetically "happy" by keeping the column names unchanged:

    wrap_frame(mt, nc = 3, rownames = "", unique_names = FALSE)
    #                       mpg cyl  disp                      mpg cyl  disp                   mpg cyl  disp
    # 1          Mazda RX4 21.0   6 160.0          Merc 450SE 16.4   8 275.8      AMC Javelin 15.2   8 304.0
    # 2      Mazda RX4 Wag 21.0   6 160.0          Merc 450SL 17.3   8 275.8       Camaro Z28 13.3   8 350.0
    # 3         Datsun 710 22.8   4 108.0         Merc 450SLC 15.2   8 275.8 Pontiac Firebird 19.2   8 400.0
    # 4     Hornet 4 Drive 21.4   6 258.0  Cadillac Fleetwood 10.4   8 472.0        Fiat X1-9 27.3   4  79.0
    # 5  Hornet Sportabout 18.7   8 360.0 Lincoln Continental 10.4   8 460.0    Porsche 914-2 26.0   4 120.3
    # 6            Valiant 18.1   6 225.0   Chrysler Imperial 14.7   8 440.0     Lotus Europa 30.4   4  95.1
    # 7         Duster 360 14.3   8 360.0            Fiat 128 32.4   4  78.7   Ford Pantera L 15.8   8 351.0
    # 8          Merc 240D 24.4   4 146.7         Honda Civic 30.4   4  75.7     Ferrari Dino 19.7   6 145.0
    # 9           Merc 230 22.8   4 140.8      Toyota Corolla 33.9   4  71.1    Maserati Bora 15.0   8 301.0
    # 10          Merc 280 19.2   6 167.6       Toyota Corona 21.5   4 120.1       Volvo 142E 21.4   4 121.0
    # 11         Merc 280C 17.8   6 167.6    Dodge Challenger 15.5   8 318.0             <NA>   NA  NA    NA
    

    Note: the NA are required by R since a data.frame must be rectangular. If I inserted blanks, then all numbers would be converted to character, not what you want/need (I suspect). However, when exported, you typically have the option to declare how an NA is represented in the output, with things like:

    • write.table(..., na="")
    • readr::write_csv(..., na="")
    • options(knitr.kable.NA=""); knitr::kable(...)

    I've added this as a gist: https://gist.github.com/r2evans/f99f77d253cfbf6431db575f0bf2a7ea