Search code examples
rgroupingsample

groups of different size randomly selected within different classes


i have such a difficult question (at least to me) that i spend 2 hours just writing it. Complete impossible to program it by my self. I try to be very clear and i´m sorry if i didn´t. I´m doing this in a very rustic way in excel, but i really need to program this. i have a data.frame like this

id_pix id_lote clase   f1   f2
45       4      Sg    2460 2401
46       4      Sg    2620 2422
47       4      Sg    2904 2627
48       5      M     2134 2044
49       5      M     2180 2104
50       5      M     2127 2069
83      11      S     2124 2062
84      11      S     2189 2336
85      11      S     2235 2162
86      11      S     2162 2153
87      11      S     2108 2124

with 17451 "id_pixel"(rows), 2080 "id_lote" and 9 "clase"

this is the "id_lote" count per "clase" (v1 is the id_lote count)

 clase   v1
1:     S 1099
2:     P  213
3:    Sg  114
4:     M  302
5:   Alg   27
6:    Az   77
7:    Po  228
8:   Cit   13
9:    Ma    7

i need to split the "id_lote" randomly within the "clase". I mean i have 1099 "id_lote" for the "S" "clase" that are 9339 "id_pixel" (rows) and i want to randomly select 50 % of "id_lote" that are x "id_pixel"(rows). And do this for every "clase" considering that the size (number of "id_lote") of every "clase" are different. I also would like to be able to change the size of the selection (50 %, 30 %, etc). And i also want to keep the not selected set of "id_lote". I hope some one can help me with this!

here is the reproducible example

this is the data with 2 clase (S and Az), with 6 id_lote and 13 id_pixel

id_pix  id_lote clase   f1  f2
1       1        S    2909  2381
2       1        S    2515  2663
3       1        S    2628  3249
30      2        S    3021  2985
31      2        S    3020  2596
71      9        S    4725  4404
72      9        S    4759  4943
75      11       S    2728  2225
218     21       Az   4830  3007
219     21       Az   4574  2761
220     21       Az   5441  3092
1155    126      Az   7209  2449
1156    126      Az   7035  2932

and one result could be:

id_pix  id_lote clase   f1  f2
    1       1        S    2909  2381
    2       1        S    2515  2663
    3       1        S    2628  3249
    75      11       S    2728  2225
    1155    126      Az   7209  2449
    1156    126      Az   7035  2932

were 50% of id_lote were randomly selected in clase "S" (2 of 4 id_lote) but all the id_pixel in selected id_lote were keeped. The same for clase "Az", one id_lote was randomly selected (1 of 2 in this case) and all the id_pixel in selected id_lote were keeped.

what colemand77 proposed helped a lot. I think dplyr package is usefull for this but i think that if i do

df %>%
group_by(clase, id_lote) %>%
sample_frac(.3, replace = FALSE)

i get the 30 % of the data of each clase but not grouped by id_lote like i need! I mean 30 % of the rows (id_pixel) were selected instead of id_lote. i hope this example help to understand what i want to do and make it usefull for everybody. I´m sorry if i wasn´t clear enough the first time. Thanks a lot!


Solution

  • First glimpse I'd say the dplyr package is your friend here.

    df %>%
    group_by(clase, id_lote) %>%
    sample_frac(.3, replace = FALSE)
    

    so you first use group_by() and include the grouping levels you want to sample from, then you use sample_frac to sample the fraction of the results you want for each group. As near as I can tell this is what you are asking for. If not, please consider re-stating your question to include either a reproducible example or clarify. Cheers.

    to "keep" the not-selected members, I would add a column of unique ids, and use an anti-join anti_join()(also from the dplyr package) to find the id's that are not in common between the two data.frames (the results of the sampling and the original).


    ## Update ##

    I'm understanding better now, I believe. Think about this as a two step process... 1) you want to select x% (50 in example) of the id_lote from each clase and return those id_lote #s (i'm assuming that a given id_lote does not exist for multiple clase?) 2) you want to see all of the id_pixels that correspond to each id_lote, all in one data.frame

    I've broken this down into multiple steps for illustration, not because it is the fastest / prettiest.

    raw data: (couldn't read your data into R.)

    df<-data.frame(id_pix = c(1:200), 
               id_lote = sample(1:20,200, replace = TRUE),
               clase = sample(letters[seq_along(1:10)], 200, replace = TRUE),
               f1 = sample(1000:2000,200, replace = TRUE),
               f2 = sample(2000:3000,200, replace = TRUE))
    

    1) figure out which id_lote correspond to which clase - for this we use the dplyr summarise function and store it in a variable

    summary<-df %>%
      ungroup() %>%
      group_by(clase, id_lote) %>%
      summarise()
    

    returns:

    Source: local data frame [125 x 2]
    Groups: clase
    
       clase id_lote
    1      a       1
    2      a       2
    3      a       4
    4      a       5
    5      a       6
    6      a       7
    7      a       8
    8      a       9
    9      a      11
    10     a      12
    ..   ...     ...
    

    then we sample to get the 30% of the id_lote for each clase..

    sampled_summary <- summary %>%
      group_by(clase) %>%
      sample_frac(.3,replace = FALSE) 
    

    so the result of this is a data table with two columns, (clase and id_lote) with 30% of the id_lotes shown for each clase.

    2) ok so now we have the id_lotes randomly selected from each class but not the id_pix that are associated with that class. To accomplish this we do a join to get the corresponding full data set including the id_pix, etc.

    result <- sampled_summary %>%
      left_join(df)
    

    The above makes a copy of the data set a bunch, so if you have a substantial data set you could just do it all at one go:

    result <- df %>%
      ungroup() %>%
      group_by(clase, id_lote) %>%
      summarise() %>%
      group_by(clase) %>%
      sample_frac(.5,replace = FALSE) %>%
      left_join(df)
    

    if this doesn't get you what you want, let me know and we'll take another crack at it.