Search code examples
rdataframedistancesample

Reduce large data-frame of samples to ensure maximum variability between samples


I have a list of vectors that each entry in the list is a vector of indices, for example:

list(c(563L, 688L, 630L, 160L, 568L, 908L, 457L, 798L, 3L, 558L, 
56L, 389L, 506L, 106L, 807L, 556L, 809L, 63L, 343L, 242L, 470L, 
894L, 804L, 970L, 406L, 881L, 893L, 952L, 126L, 827L, 282L, 910L, 
61L, 66L, 763L, 787L, 337L, 41L, 712L, 144L, 450L, 12L, 200L, 
574L, 945L, 236L, 336L, 684L, 280L, 721L, 233L, 686L, 64L, 504L, 
174L, 934L, 40L, 850L, 26L, 799L, 853L, 978L), c(85L, 564L, 591L, 
662L, 377L, 536L, 325L, 402L, 72L, 410L, 687L, 216L, 603L, 67L, 
794L, 388L, 627L, 376L, 863L, 491L, 598L, 861L, 991L, 651L, 670L, 
401L, 459L, 39L, 997L, 806L, 623L, 954L), c(427L, 791L, 212L, 
779L, 657L, 740L, 800L, 838L, 104L, 985L, 167L, 486L, 685L, 739L, 
60L, 862L, 130L, 134L, 175L, 375L, 683L, 885L, 575L, 859L, 341L, 
726L, 472L, 802L, 76L, 424L, 177L, 624L, 189L, 334L, 378L, 329L, 
581L, 224L, 851L, 218L, 993L, 678L, 248L, 365L, 188L, 774L, 58L, 
813L, 514L, 59L, 777L, 485L, 606L, 480L, 826L, 350L, 608L, 27L, 
661L, 775L, 340L, 10L, 207L, 260L, 483L, 150L, 205L), c(138L, 
587L, 165L, 1L, 722L, 300L, 500L, 535L, 832L, 392L, 432L, 139L, 
744L, 676L, 839L, 107L, 769L, 589L, 647L, 548L, 704L, 197L, 689L, 
111L, 342L, 319L, 567L, 17L, 925L, 5L, 116L, 493L, 241L, 965L
), c(89L, 440L, 228L, 884L, 88L, 147L, 413L, 821L, 70L, 95L, 
71L, 917L, 463L, 990L, 672L, 981L, 765L, 937L, 75L, 766L, 374L, 
636L, 449L, 816L, 1000L, 356L, 629L), c(421L, 650L, 453L, 666L, 
584L, 717L, 220L, 605L, 182L, 811L, 157L, 523L, 28L, 527L, 737L, 
812L, 263L, 675L, 132L, 879L, 438L, 451L, 883L, 950L, 114L, 466L, 
348L, 711L, 209L, 887L, 593L, 949L, 349L, 764L, 595L, 736L, 660L, 
801L, 118L, 877L), c(23L, 231L, 78L, 988L, 55L, 57L, 753L, 994L, 
437L, 202L, 842L, 190L, 822L, 968L, 331L, 733L, 782L, 886L, 105L, 
943L, 743L, 815L, 311L, 498L, 792L, 795L, 184L, 728L, 573L, 771L, 
117L, 251L, 192L, 735L, 15L, 776L, 295L, 677L, 631L, 235L, 237L, 
705L, 856L, 97L, 725L), c(229L, 671L, 129L, 405L, 115L, 644L, 
98L, 492L, 871L, 935L, 435L, 707L, 773L, 754L, 803L, 120L, 656L, 
345L, 875L, 330L, 533L, 366L, 240L, 408L, 332L, 577L, 550L, 452L, 
963L, 8L, 187L, 226L, 901L, 371L, 426L, 339L, 519L, 86L, 501L, 
274L, 831L), c(16L, 79L, 68L, 477L, 133L, 659L, 2L, 973L, 264L, 
953L, 90L, 234L, 420L, 588L, 21L, 788L, 363L, 539L, 227L, 565L, 
30L, 642L, 786L, 982L, 347L, 680L, 52L, 96L, 592L, 409L, 643L, 
81L, 419L, 245L, 658L, 416L, 590L, 448L, 819L, 277L, 357L, 442L, 
789L, 516L, 980L, 93L, 998L, 149L, 166L, 299L, 454L, 529L, 986L, 
127L, 541L, 45L, 829L, 289L, 418L, 179L, 310L, 113L, 729L), c(429L, 
781L, 303L, 434L, 83L, 259L, 387L, 583L, 393L, 770L, 246L, 428L, 
947L, 976L, 31L, 382L, 710L, 944L, 164L, 868L, 373L, 899L, 74L, 
468L, 614L, 701L, 221L, 645L, 268L, 785L, 293L, 632L, 24L, 749L, 
283L, 741L, 796L, 915L), c(258L, 844L, 649L, 752L, 474L, 613L, 
351L, 551L, 309L, 380L, 497L, 724L, 327L, 992L, 845L, 607L, 818L, 
693L, 914L, 291L, 720L, 633L, 974L, 367L, 639L, 94L, 467L, 92L, 
522L, 141L, 496L, 276L, 542L, 665L, 695L, 634L, 602L, 913L, 396L, 
597L, 443L, 892L, 65L, 394L, 222L, 778L, 169L, 960L, 35L, 655L, 
422L, 927L, 154L, 215L, 262L, 203L, 880L, 217L, 423L, 755L, 904L, 
180L, 620L), c(507L, 628L, 29L, 902L, 738L, 897L, 664L, 967L, 
294L, 682L, 254L, 302L, 128L, 559L, 511L, 526L, 7L, 742L, 464L, 
621L, 265L, 599L, 102L, 546L, 458L, 969L, 751L, 860L, 326L, 873L, 
335L, 580L, 499L, 962L, 290L, 557L, 213L, 716L, 53L, 835L, 600L, 
610L, 321L, 673L, 713L, 876L, 244L, 462L, 136L, 272L, 195L, 447L, 
230L, 679L, 465L, 611L, 297L, 731L, 44L, 824L, 162L, 837L), c(446L, 
561L, 391L, 652L, 857L, 946L, 560L, 784L, 854L, 204L, 512L, 82L, 
455L, 372L, 407L, 328L, 808L, 152L, 178L, 185L, 543L, 108L, 473L, 
490L, 955L, 719L, 757L, 198L, 338L, 223L, 919L, 531L, 653L, 734L, 
923L, 487L, 637L, 398L, 431L, 46L, 848L, 324L, 948L, 43L, 183L, 
288L, 697L, 87L, 307L, 42L, 571L, 360L, 433L, 390L, 569L, 956L, 
534L, 6L, 381L, 549L, 301L, 920L, 69L, 322L, 267L, 503L, 285L, 
961L, 370L, 425L), c(344L, 959L, 364L, 552L, 11L, 481L, 287L, 
891L, 692L, 762L, 47L, 292L, 358L, 810L, 942L, 730L, 746L, 638L, 
750L, 759L, 761L, 140L, 444L, 191L, 805L, 306L, 691L, 170L, 715L, 
508L, 984L, 461L, 911L, 103L, 938L, 718L, 928L), c(124L, 284L, 
123L, 513L, 417L, 933L, 121L, 168L, 208L, 385L, 32L, 273L, 869L, 
932L, 397L, 509L, 239L, 797L, 379L, 723L, 898L, 163L, 320L, 833L, 
151L, 906L, 648L, 732L, 279L, 834L, 489L, 840L, 783L, 971L, 49L, 
145L, 253L, 352L, 137L, 261L, 247L, 143L, 544L, 109L, 921L, 830L, 
972L, 585L, 690L, 609L, 703L, 250L, 708L, 225L, 889L, 181L, 987L, 
54L, 502L, 148L, 355L, 888L, 579L, 983L, 825L, 855L, 62L, 918L, 
979L, 586L, 681L, 384L, 709L, 333L, 758L, 194L, 368L), c(646L, 
930L, 361L, 399L, 13L, 298L, 395L, 975L, 482L, 940L, 596L, 772L, 
700L, 843L, 171L, 537L, 173L, 836L, 767L, 989L, 532L, 890L, 99L, 
865L, 142L, 135L, 271L, 346L, 441L, 48L, 941L, 866L, 201L, 872L, 
36L, 520L, 530L, 77L, 270L), c(238L, 699L, 22L, 50L, 615L, 702L, 
4L, 469L, 101L, 314L, 616L, 995L, 996L, 414L, 566L, 249L, 572L, 
369L, 553L, 158L, 159L, 199L, 317L, 515L, 517L, 524L, 562L, 19L, 
476L, 20L, 146L, 618L, 895L, 312L, 912L), c(768L, 939L, 578L, 
849L, 196L, 640L, 323L, 635L, 304L, 318L, 874L, 977L, 488L, 619L, 
155L, 905L, 9L, 112L, 484L, 847L, 313L, 900L, 494L, 727L, 625L, 
931L, 119L, 846L, 186L, 219L, 471L, 696L, 404L, 460L, 668L, 896L, 
439L, 964L, 275L, 756L, 411L, 878L, 538L, 669L, 478L, 570L, 255L, 
547L, 257L, 841L, 37L, 576L, 456L, 663L, 525L, 817L, 612L, 820L
), c(243L, 594L, 33L, 176L, 415L, 667L, 748L, 852L, 232L, 922L, 
308L, 436L, 153L, 505L, 14L, 281L, 316L, 495L, 540L, 622L, 156L, 
926L, 521L, 698L, 545L, 760L, 84L, 210L, 359L, 131L, 745L, 34L, 
91L, 555L, 858L, 445L, 867L, 125L, 814L, 604L, 706L, 315L, 654L, 
747L, 936L, 269L, 957L), c(80L, 924L, 110L, 193L, 958L, 296L, 
475L, 18L, 907L, 626L, 999L, 278L, 362L, 51L, 641L, 211L, 929L, 
122L, 694L, 73L, 353L, 25L, 100L, 305L, 864L, 214L, 790L, 286L, 
518L, 674L, 206L, 400L, 554L, 903L, 780L, 916L, 38L, 430L, 617L, 
823L, 172L, 966L, 412L, 951L, 510L, 828L, 479L, 909L, 266L, 582L, 
870L, 882L, 161L, 252L, 256L, 383L, 403L, 601L, 386L, 793L, 528L, 
354L, 714L))

Where each entry (or each nested list) represents a group obtained using a clustering method.

Now I have the following piece of code that takes this list of nested lists and the amount of samples required and returns a data-frame where each row represents a single sample and each column is a single sample from a group from one of the nested list.

groups_samples <- function(groups, repetition) {
  return(as.data.frame(sapply(groups, sample, repetition, TRUE)))
}

Let's take the following as an example:

df <- groups_samples(ll, 100)

    structure(list(V1 = c(106L, 686L, 721L, 200L, 970L, 910L, 556L, 
807L, 908L, 568L, 688L, 389L, 56L, 470L, 630L, 893L, 574L, 236L, 
804L, 798L, 721L, 934L, 763L, 807L, 457L, 568L, 684L, 934L, 787L, 
450L, 688L, 64L, 568L, 934L, 894L, 558L, 568L, 343L, 450L, 853L, 
336L, 64L, 712L, 144L, 934L, 144L, 809L, 763L, 457L, 763L, 558L, 
457L, 688L, 763L, 504L, 66L, 406L, 881L, 3L, 343L, 556L, 799L, 
712L, 568L, 61L, 799L, 908L, 688L, 64L, 881L, 236L, 787L, 66L, 
160L, 853L, 343L, 809L, 200L, 827L, 893L, 894L, 799L, 470L, 406L, 
337L, 389L, 63L, 952L, 236L, 337L, 763L, 41L, 945L, 144L, 56L, 
978L, 233L, 978L, 881L, 910L), V2 = c(72L, 651L, 861L, 651L, 
591L, 72L, 564L, 662L, 402L, 623L, 603L, 377L, 401L, 603L, 598L, 
67L, 991L, 376L, 67L, 325L, 325L, 377L, 536L, 861L, 564L, 670L, 
806L, 377L, 687L, 603L, 954L, 627L, 67L, 388L, 954L, 564L, 991L, 
564L, 591L, 863L, 376L, 991L, 85L, 85L, 564L, 598L, 591L, 687L, 
806L, 564L, 401L, 72L, 603L, 536L, 459L, 603L, 954L, 67L, 216L, 
410L, 687L, 806L, 623L, 388L, 67L, 401L, 491L, 662L, 85L, 627L, 
598L, 954L, 459L, 591L, 997L, 687L, 687L, 536L, 863L, 459L, 670L, 
459L, 603L, 401L, 39L, 687L, 39L, 651L, 991L, 376L, 388L, 954L, 
997L, 85L, 39L, 627L, 861L, 670L, 39L, 459L), V3 = c(424L, 775L, 
862L, 791L, 683L, 826L, 60L, 205L, 802L, 740L, 58L, 985L, 683L, 
341L, 838L, 212L, 993L, 59L, 851L, 657L, 375L, 885L, 150L, 167L, 
218L, 205L, 58L, 260L, 341L, 661L, 791L, 350L, 726L, 378L, 188L, 
150L, 60L, 813L, 774L, 104L, 207L, 207L, 485L, 514L, 424L, 514L, 
859L, 130L, 350L, 188L, 188L, 740L, 859L, 177L, 212L, 802L, 606L, 
104L, 608L, 260L, 329L, 993L, 427L, 427L, 485L, 472L, 859L, 424L, 
661L, 514L, 791L, 678L, 993L, 726L, 188L, 340L, 483L, 150L, 340L, 
514L, 606L, 248L, 205L, 188L, 581L, 813L, 175L, 657L, 862L, 775L, 
212L, 341L, 27L, 885L, 575L, 334L, 350L, 486L, 483L, 340L), V4 = c(138L, 
493L, 111L, 241L, 548L, 107L, 548L, 965L, 839L, 1L, 139L, 1L, 
165L, 769L, 111L, 965L, 548L, 1L, 676L, 319L, 689L, 769L, 567L, 
197L, 139L, 319L, 319L, 832L, 116L, 500L, 392L, 704L, 689L, 500L, 
689L, 832L, 165L, 138L, 116L, 676L, 197L, 589L, 832L, 165L, 925L, 
165L, 647L, 832L, 116L, 744L, 587L, 925L, 500L, 116L, 107L, 832L, 
500L, 319L, 17L, 925L, 116L, 548L, 17L, 107L, 676L, 111L, 832L, 
925L, 111L, 107L, 17L, 722L, 139L, 432L, 319L, 548L, 241L, 769L, 
319L, 17L, 689L, 342L, 165L, 722L, 676L, 319L, 197L, 241L, 139L, 
139L, 111L, 744L, 689L, 722L, 965L, 432L, 647L, 432L, 1L, 111L
), V5 = c(816L, 95L, 884L, 821L, 88L, 374L, 981L, 672L, 70L, 
71L, 89L, 95L, 374L, 75L, 917L, 765L, 917L, 449L, 71L, 884L, 
766L, 70L, 672L, 89L, 816L, 937L, 937L, 440L, 413L, 1000L, 1000L, 
413L, 70L, 356L, 821L, 440L, 990L, 821L, 147L, 356L, 629L, 374L, 
766L, 766L, 71L, 937L, 89L, 95L, 917L, 937L, 937L, 449L, 95L, 
463L, 1000L, 440L, 821L, 884L, 917L, 816L, 89L, 1000L, 766L, 
356L, 765L, 440L, 75L, 463L, 440L, 440L, 765L, 636L, 672L, 629L, 
88L, 356L, 374L, 374L, 463L, 95L, 463L, 75L, 71L, 89L, 449L, 
88L, 990L, 884L, 765L, 463L, 884L, 672L, 463L, 449L, 629L, 821L, 
981L, 75L, 990L, 440L), V6 = c(650L, 675L, 737L, 466L, 883L, 
877L, 209L, 887L, 584L, 263L, 605L, 132L, 584L, 950L, 650L, 451L, 
737L, 453L, 348L, 675L, 949L, 349L, 209L, 584L, 801L, 593L, 711L, 
666L, 466L, 605L, 527L, 666L, 584L, 717L, 114L, 660L, 118L, 466L, 
811L, 595L, 438L, 28L, 593L, 811L, 118L, 711L, 605L, 593L, 466L, 
650L, 801L, 438L, 348L, 349L, 118L, 584L, 114L, 584L, 801L, 209L, 
157L, 466L, 801L, 182L, 812L, 132L, 523L, 666L, 605L, 527L, 950L, 
950L, 812L, 421L, 584L, 801L, 132L, 182L, 737L, 887L, 883L, 605L, 
737L, 711L, 28L, 675L, 220L, 157L, 118L, 887L, 675L, 132L, 736L, 
811L, 887L, 438L, 182L, 717L, 737L, 950L), V7 = c(994L, 202L, 
311L, 725L, 437L, 725L, 776L, 295L, 792L, 57L, 57L, 295L, 842L, 
15L, 776L, 331L, 822L, 795L, 78L, 988L, 498L, 822L, 988L, 782L, 
776L, 728L, 631L, 725L, 735L, 573L, 105L, 295L, 23L, 78L, 202L, 
117L, 190L, 705L, 105L, 57L, 792L, 251L, 251L, 968L, 192L, 23L, 
231L, 822L, 295L, 231L, 631L, 842L, 57L, 235L, 815L, 331L, 117L, 
705L, 331L, 994L, 795L, 237L, 815L, 815L, 23L, 822L, 235L, 631L, 
78L, 97L, 57L, 192L, 677L, 184L, 57L, 231L, 231L, 753L, 733L, 
237L, 743L, 677L, 631L, 988L, 815L, 311L, 815L, 311L, 771L, 728L, 
23L, 988L, 728L, 705L, 97L, 988L, 994L, 57L, 728L, 192L), V8 = c(754L, 
875L, 332L, 935L, 86L, 339L, 86L, 644L, 339L, 501L, 803L, 229L, 
644L, 426L, 550L, 129L, 330L, 129L, 229L, 86L, 773L, 803L, 129L, 
901L, 452L, 8L, 229L, 98L, 129L, 366L, 187L, 8L, 773L, 187L, 
229L, 8L, 98L, 935L, 98L, 345L, 754L, 533L, 332L, 550L, 240L, 
875L, 773L, 229L, 426L, 754L, 120L, 803L, 129L, 901L, 901L, 644L, 
345L, 707L, 707L, 773L, 533L, 120L, 332L, 330L, 803L, 86L, 803L, 
8L, 226L, 345L, 871L, 240L, 550L, 963L, 330L, 345L, 226L, 533L, 
366L, 452L, 803L, 405L, 803L, 405L, 550L, 577L, 8L, 339L, 901L, 
577L, 330L, 229L, 330L, 656L, 452L, 330L, 519L, 226L, 366L, 435L
), V9 = c(643L, 953L, 642L, 21L, 592L, 16L, 127L, 539L, 409L, 
516L, 419L, 277L, 986L, 590L, 45L, 980L, 998L, 516L, 541L, 980L, 
454L, 81L, 149L, 986L, 227L, 45L, 420L, 363L, 986L, 90L, 409L, 
986L, 953L, 45L, 982L, 588L, 68L, 127L, 127L, 16L, 418L, 21L, 
953L, 442L, 418L, 419L, 565L, 980L, 659L, 16L, 149L, 448L, 789L, 
454L, 516L, 2L, 127L, 79L, 277L, 980L, 234L, 357L, 357L, 642L, 
980L, 680L, 729L, 81L, 21L, 454L, 986L, 357L, 980L, 973L, 680L, 
592L, 788L, 2L, 264L, 79L, 680L, 729L, 52L, 986L, 539L, 79L, 
277L, 416L, 786L, 477L, 113L, 454L, 419L, 442L, 953L, 79L, 245L, 
788L, 93L, 234L), V10 = c(31L, 468L, 468L, 387L, 164L, 796L, 
701L, 785L, 915L, 614L, 741L, 770L, 770L, 583L, 373L, 373L, 393L, 
221L, 303L, 83L, 74L, 785L, 387L, 741L, 741L, 393L, 468L, 701L, 
382L, 393L, 387L, 899L, 429L, 947L, 781L, 781L, 645L, 645L, 710L, 
915L, 74L, 796L, 259L, 749L, 373L, 393L, 246L, 632L, 785L, 259L, 
614L, 785L, 428L, 741L, 632L, 382L, 770L, 710L, 781L, 749L, 868L, 
915L, 434L, 221L, 429L, 303L, 393L, 468L, 632L, 976L, 781L, 373L, 
947L, 428L, 781L, 781L, 645L, 868L, 645L, 710L, 283L, 31L, 868L, 
583L, 915L, 246L, 373L, 373L, 781L, 164L, 428L, 710L, 373L, 303L, 
632L, 868L, 614L, 947L, 74L, 382L), V11 = c(351L, 154L, 423L, 
496L, 818L, 913L, 665L, 913L, 380L, 720L, 542L, 380L, 634L, 551L, 
258L, 818L, 634L, 474L, 222L, 639L, 974L, 755L, 262L, 665L, 522L, 
217L, 927L, 351L, 755L, 914L, 380L, 65L, 844L, 633L, 613L, 222L, 
649L, 892L, 752L, 423L, 755L, 169L, 904L, 309L, 639L, 276L, 217L, 
394L, 291L, 522L, 203L, 720L, 35L, 422L, 724L, 423L, 720L, 914L, 
180L, 327L, 92L, 422L, 258L, 467L, 724L, 620L, 665L, 367L, 639L, 
443L, 892L, 724L, 141L, 422L, 327L, 396L, 92L, 309L, 844L, 258L, 
914L, 634L, 497L, 222L, 141L, 880L, 467L, 443L, 496L, 913L, 394L, 
217L, 35L, 396L, 35L, 880L, 351L, 755L, 474L, 215L), V12 = c(102L, 
546L, 682L, 464L, 162L, 876L, 162L, 302L, 682L, 162L, 302L, 53L, 
967L, 679L, 837L, 824L, 44L, 53L, 294L, 738L, 254L, 557L, 546L, 
7L, 902L, 244L, 128L, 499L, 621L, 499L, 458L, 526L, 837L, 465L, 
290L, 969L, 265L, 507L, 835L, 837L, 546L, 136L, 897L, 213L, 195L, 
244L, 465L, 835L, 464L, 621L, 162L, 511L, 969L, 230L, 580L, 335L, 
610L, 969L, 546L, 897L, 835L, 447L, 526L, 302L, 464L, 302L, 682L, 
628L, 610L, 272L, 53L, 254L, 969L, 962L, 511L, 621L, 290L, 458L, 
559L, 860L, 136L, 507L, 462L, 136L, 462L, 731L, 873L, 462L, 335L, 
897L, 580L, 447L, 628L, 731L, 7L, 335L, 102L, 128L, 679L, 742L
), V13 = c(108L, 637L, 757L, 734L, 534L, 42L, 808L, 322L, 757L, 
204L, 808L, 324L, 288L, 82L, 285L, 961L, 955L, 652L, 808L, 961L, 
503L, 549L, 697L, 87L, 734L, 43L, 204L, 455L, 398L, 961L, 183L, 
433L, 431L, 854L, 490L, 69L, 407L, 808L, 398L, 69L, 87L, 338L, 
446L, 178L, 6L, 198L, 82L, 543L, 370L, 534L, 87L, 267L, 455L, 
360L, 534L, 407L, 431L, 446L, 854L, 857L, 46L, 637L, 848L, 923L, 
560L, 531L, 919L, 223L, 307L, 561L, 6L, 719L, 560L, 43L, 734L, 
288L, 324L, 87L, 808L, 322L, 757L, 446L, 425L, 324L, 757L, 857L, 
87L, 848L, 223L, 503L, 307L, 152L, 503L, 757L, 956L, 152L, 43L, 
69L, 719L, 637L), V14 = c(746L, 805L, 191L, 47L, 508L, 508L, 
715L, 461L, 928L, 750L, 140L, 746L, 364L, 552L, 287L, 984L, 481L, 
715L, 762L, 959L, 750L, 344L, 959L, 959L, 306L, 911L, 103L, 638L, 
759L, 761L, 750L, 444L, 692L, 692L, 761L, 481L, 552L, 942L, 810L, 
938L, 306L, 762L, 344L, 942L, 344L, 364L, 552L, 891L, 11L, 103L, 
762L, 287L, 891L, 358L, 730L, 959L, 750L, 191L, 718L, 959L, 358L, 
306L, 287L, 692L, 746L, 461L, 750L, 170L, 358L, 911L, 805L, 938L, 
481L, 759L, 750L, 140L, 715L, 959L, 928L, 692L, 461L, 750L, 306L, 
762L, 691L, 306L, 287L, 481L, 170L, 746L, 810L, 762L, 358L, 292L, 
750L, 191L, 47L, 942L, 344L, 191L), V15 = c(987L, 972L, 151L, 
397L, 250L, 825L, 681L, 825L, 723L, 49L, 585L, 109L, 833L, 137L, 
49L, 690L, 681L, 253L, 385L, 921L, 708L, 151L, 109L, 385L, 54L, 
247L, 979L, 121L, 225L, 124L, 825L, 417L, 320L, 979L, 681L, 918L, 
145L, 397L, 681L, 145L, 586L, 709L, 284L, 840L, 121L, 368L, 250L, 
898L, 840L, 109L, 417L, 513L, 544L, 194L, 417L, 544L, 320L, 987L, 
840L, 987L, 888L, 489L, 855L, 906L, 62L, 579L, 379L, 783L, 368L, 
379L, 49L, 732L, 279L, 509L, 54L, 145L, 797L, 979L, 709L, 840L, 
368L, 830L, 502L, 123L, 681L, 194L, 855L, 703L, 247L, 833L, 609L, 
830L, 708L, 609L, 509L, 397L, 987L, 609L, 320L, 124L), V16 = c(346L, 
48L, 865L, 865L, 173L, 890L, 482L, 13L, 537L, 171L, 482L, 940L, 
843L, 173L, 975L, 866L, 142L, 646L, 482L, 700L, 395L, 298L, 975L, 
890L, 361L, 173L, 890L, 975L, 940L, 271L, 395L, 989L, 395L, 142L, 
865L, 361L, 399L, 441L, 441L, 772L, 142L, 520L, 142L, 520L, 975L, 
930L, 890L, 989L, 530L, 866L, 941L, 530L, 596L, 890L, 36L, 441L, 
346L, 865L, 173L, 646L, 270L, 441L, 866L, 866L, 346L, 441L, 482L, 
872L, 36L, 890L, 271L, 13L, 36L, 836L, 767L, 395L, 890L, 537L, 
395L, 530L, 346L, 346L, 940L, 173L, 865L, 772L, 520L, 171L, 48L, 
866L, 135L, 298L, 135L, 77L, 361L, 872L, 395L, 596L, 772L, 532L
), V17 = c(912L, 146L, 312L, 22L, 618L, 317L, 618L, 199L, 369L, 
101L, 515L, 4L, 476L, 699L, 517L, 317L, 159L, 517L, 553L, 616L, 
995L, 314L, 317L, 314L, 562L, 101L, 249L, 369L, 615L, 562L, 476L, 
702L, 312L, 312L, 515L, 101L, 159L, 572L, 101L, 618L, 895L, 317L, 
616L, 618L, 572L, 562L, 4L, 517L, 312L, 312L, 249L, 699L, 312L, 
158L, 469L, 20L, 524L, 476L, 572L, 249L, 50L, 19L, 249L, 912L, 
469L, 476L, 101L, 146L, 616L, 618L, 476L, 20L, 146L, 249L, 50L, 
101L, 158L, 517L, 238L, 515L, 895L, 553L, 702L, 146L, 312L, 517L, 
158L, 895L, 517L, 101L, 314L, 238L, 22L, 146L, 317L, 895L, 469L, 
912L, 369L, 572L), V18 = c(525L, 635L, 488L, 456L, 878L, 119L, 
119L, 849L, 768L, 817L, 931L, 275L, 460L, 900L, 494L, 669L, 846L, 
488L, 768L, 494L, 570L, 439L, 878L, 275L, 471L, 896L, 768L, 619L, 
727L, 977L, 155L, 155L, 896L, 112L, 817L, 768L, 411L, 304L, 964L, 
612L, 905L, 768L, 456L, 255L, 119L, 404L, 304L, 576L, 219L, 756L, 
612L, 668L, 255L, 768L, 196L, 668L, 155L, 931L, 896L, 878L, 488L, 
576L, 640L, 37L, 846L, 494L, 257L, 37L, 411L, 411L, 625L, 820L, 
304L, 112L, 619L, 9L, 669L, 494L, 471L, 323L, 318L, 570L, 817L, 
578L, 878L, 696L, 977L, 768L, 896L, 525L, 669L, 841L, 471L, 727L, 
619L, 304L, 874L, 931L, 37L, 619L), V19 = c(926L, 281L, 957L, 
308L, 315L, 814L, 622L, 153L, 858L, 315L, 867L, 176L, 555L, 210L, 
867L, 540L, 555L, 867L, 622L, 852L, 540L, 436L, 269L, 505L, 436L, 
505L, 654L, 505L, 91L, 125L, 131L, 706L, 243L, 125L, 922L, 281L, 
91L, 359L, 33L, 957L, 232L, 698L, 555L, 540L, 667L, 34L, 545L, 
698L, 555L, 308L, 926L, 445L, 316L, 748L, 243L, 14L, 521L, 232L, 
654L, 243L, 232L, 359L, 156L, 131L, 555L, 359L, 521L, 852L, 706L, 
957L, 308L, 125L, 91L, 852L, 315L, 604L, 604L, 760L, 604L, 936L, 
521L, 747L, 922L, 555L, 243L, 521L, 316L, 867L, 84L, 176L, 814L, 
232L, 315L, 316L, 555L, 505L, 745L, 505L, 232L, 540L), V20 = c(554L, 
882L, 823L, 386L, 966L, 694L, 286L, 354L, 214L, 25L, 25L, 110L, 
353L, 475L, 479L, 252L, 582L, 999L, 266L, 211L, 18L, 278L, 828L, 
412L, 528L, 386L, 296L, 353L, 412L, 80L, 206L, 714L, 18L, 211L, 
475L, 554L, 38L, 882L, 25L, 362L, 510L, 110L, 206L, 823L, 362L, 
694L, 256L, 479L, 582L, 25L, 828L, 193L, 951L, 80L, 793L, 999L, 
882L, 903L, 38L, 386L, 354L, 214L, 916L, 25L, 110L, 864L, 882L, 
25L, 353L, 780L, 296L, 864L, 510L, 38L, 386L, 400L, 694L, 793L, 
999L, 122L, 278L, 475L, 916L, 903L, 958L, 161L, 828L, 73L, 790L, 
73L, 430L, 18L, 958L, 828L, 582L, 383L, 51L, 278L, 18L, 122L)), class = "data.frame", row.names = c(NA, 
-100L))

Now what I wish to do is reduce the amount, let's say from 100 to 50 entries, where each entry is couple of indices 1 from each group. I tried to calculate the distance matrix using several methods and chose the most distant entries, but when I examined it was not so informative.

Is there a way to do it, maybe to consider the list of lists or other sophisticated methods?

Would appreciate some help/insights

Edit - Clarifing the objective

Lets say I sampled 100 groups where each group contains 1 element from each list of the nested lists.

Some of the groups are close to others, let's say only 1 element is different between the 2 groups, so I will probably will want to discard it. Or even only 2 elements are different etc. But I wish to keep eventually the K groups which as "distant" as possible.

Also nice if it is possible to consider is the amount of elements in a specific nested list, some sort of weighting procedure.

Edit No.2

for the following list(c(1L, 5L, 6L), c(3L, 4L, 2L, 9L), c(8L, 7L, 10L)) we get the following data-frame:

structure(list(V1 = c(1L, 5L, 6L, 1L, 6L, 1L, 1L, 6L, 1L, 5L, 
5L, 5L, 1L, 1L, 5L, 6L, 5L, 6L, 6L, 5L, 5L, 5L, 6L, 5L, 6L, 1L, 
6L, 1L, 1L, 1L, 5L, 5L, 6L, 6L, 5L, 1L, 6L, 6L, 5L, 6L, 1L, 1L, 
5L, 5L, 5L, 1L, 6L, 5L, 1L, 5L, 5L, 5L, 5L, 1L, 5L, 5L, 1L, 6L, 
5L, 6L, 5L, 6L, 5L, 1L, 5L, 1L, 5L, 6L, 5L, 1L, 6L, 1L, 6L, 1L, 
1L, 5L, 5L, 6L, 1L, 5L, 1L, 5L, 5L, 6L, 6L, 1L, 1L, 6L, 6L, 6L, 
5L, 5L, 1L, 6L, 1L, 1L, 6L, 5L, 5L, 1L), V2 = c(9L, 3L, 9L, 4L, 
2L, 4L, 3L, 3L, 3L, 2L, 2L, 9L, 3L, 3L, 2L, 2L, 9L, 9L, 9L, 3L, 
4L, 3L, 2L, 3L, 4L, 2L, 2L, 3L, 4L, 9L, 9L, 2L, 3L, 2L, 9L, 9L, 
3L, 2L, 4L, 4L, 3L, 4L, 3L, 2L, 2L, 9L, 9L, 2L, 4L, 4L, 4L, 9L, 
2L, 3L, 9L, 3L, 3L, 2L, 2L, 2L, 4L, 2L, 4L, 3L, 3L, 3L, 2L, 9L, 
9L, 9L, 2L, 9L, 3L, 3L, 9L, 4L, 3L, 3L, 4L, 3L, 4L, 4L, 4L, 4L, 
2L, 9L, 9L, 4L, 9L, 2L, 2L, 9L, 4L, 4L, 9L, 9L, 2L, 4L, 4L, 3L
), V3 = c(7L, 7L, 7L, 8L, 7L, 7L, 7L, 7L, 10L, 8L, 10L, 8L, 7L, 
7L, 10L, 10L, 10L, 8L, 8L, 8L, 8L, 8L, 8L, 7L, 10L, 7L, 10L, 
10L, 7L, 8L, 7L, 8L, 7L, 8L, 8L, 8L, 7L, 8L, 8L, 8L, 10L, 7L, 
8L, 7L, 7L, 10L, 7L, 7L, 10L, 7L, 10L, 8L, 8L, 7L, 10L, 10L, 
10L, 8L, 8L, 10L, 7L, 8L, 8L, 10L, 8L, 10L, 10L, 10L, 8L, 10L, 
10L, 10L, 8L, 10L, 8L, 7L, 10L, 7L, 7L, 10L, 8L, 7L, 8L, 10L, 
7L, 8L, 10L, 7L, 7L, 7L, 7L, 10L, 7L, 7L, 10L, 10L, 7L, 7L, 8L, 
10L)), class = "data.frame", row.names = c(NA, -100L))

running @Allan Cameron code, will produce the following where there are better 5:

   V1 V2 V3
26  1  2  7
68  6  9 10
7   1  3  7
17  5  9 10
13  1  3  7

Solution

  • As you have described it, the concept of overall "distance" between two groups is a bit vague. It's clear that a pair like c(1, 5, 2, 6) and c(2, 9, 12, 3) are closer than the pair c(1, 5, 2, 6) and c(101, 78, 96, 54), but should there be a penalty for an exact match? Is variance important? In the absence of a clearer notion of distance, the best measure we have is the mean of each group. This is easy to obtain by rowMeans(df).

    There's also some vagueness with regards to the concept of "the K furthest apart groups". Distance between groups is a function of pairs of groups, not individual groups. If K = 1, then presumably any group is fine. If K = 2, then you want the single pair of groups with the largest difference between their means. After that, it's not clear what you are looking for, but one approach would be to find the set of K groups which has the highest variance.

    So if we do something like:

    k <- 5
    
    group_means <- rowMeans(df)
    indices     <- seq(nrow(df))
    
    k_furthest <- c(which.min(group_means), which.max(group_means))
    k_vals     <- c(min(group_means), max(group_means))
    
    group_means <- group_means[-k_furthest]
    indices     <- indices[-k_furthest]
    
    while(length(k_furthest) < k)
    {
      best <- which.max(rowSums(sapply(k_vals, function(x) (x - group_means)^2)))
      k_vals <- c(k_vals, group_means[best])
      k_furthest <- c(k_furthest, indices[best])
      group_means <- group_means[-best]
      indices     <- indices[-best]
    }
    

    Then k_furthest will contain the set of 5 rows of the data frame with the highest possible variance between all the means. Your result would be obtained like:

     df[k_furthest,]
    #>     V1  V2  V3  V4   V5  V6  V7  V8  V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
    #> 63 236 794 885 300   71 114 725 492  52 468  92 128 948 191 585 441 414 196 156  18
    #> 51 798 536 739 704 1000 883 237 644 299 915 695 860 338  47 972 890 996 939 957 793
    #> 61  41 388 624 689  672 466  55 229 454 164 542 265 338 170  32 271 314 640 922 582
    #> 33 970 598 775 548  228 132 842 644 986 781 818 679 920 287 825 361 562 756 748 929
    #> 12 336 216 774 107   71 801 725 492 642  74 613 297 948 306 124 646  19 439 281 122
    

    Note though that this algorithm effectively just takes the rows with the highest and lowest means alternately on each iteration. Although this produces the largest overall collective "difference" between the samples, you might end up with some samples that are very close together, provided that they are also both very far apart from another sample. This may not be what you are looking for, and it is why it might be a good idea to specify exactly what you mean by "distance" in this context.

    EDIT

    With further clarification and a new example from the OP, it seems that we are looking to maximize the sum of element-wise difference between groups. This means we can do:

    distances <- as.data.frame(t(sapply(1:nrow(df), function(i) {
      a <- rowSums(apply(df, 2, function(x) abs(x[i] - x)))
      c(row = i, most_distant = which.max(a), difference = max(a))
      })))
    

    This will give us a data frame which for each row tells us the most "distant" other group.

    head(distances)
    #>   row most_distant difference
    #> 1   1           16         15
    #> 2   2           46         13
    #> 3   3            9         14
    #> 4   4           68         12
    #> 5   5           46         15
    #> 6   6           68         13
    

    If we sort this according to the biggest difference, and take the first K groups mentioned in the first two columns, we will have our result:

    i <- unique(c(t(distances[order(-distances$difference)[seq(k)], 1:2])))[seq(k)]
    
    df[i,]
    #>    V1 V2 V3
    #> 1   1  9  7
    #> 16  6  2 10
    #> 5   6  2  7
    #> 46  1  9 10
    #> 26  1  2  7