Search code examples
runiquelarge-data

Selecting specific rows from a large dataset using column values


I have a large data set (about 2000 rows and 38 columns) that looks like this (there is missing data in some columns):

     species crab cmass  gill gmass     treatment    months avglw  avgils
222      Cm   65 34.273    p 0.198     Newtons Cove      0 68.108  93.181
223      Cm   57 33.506    p 0.166     Newtons Cove      0 37.908  39.683
225      Cm   65 34.273    p 0.198     Newtons Cove      0 68.108  93.181
231      Cm   62 30.852    p 0.147     Newtons Cove      0 37.285  89.823
239      Cm   65 34.273    p 0.198     Newtons Cove      0 68.108  93.181
240      Cm   57 33.506    p 0.166     Newtons Cove      0 37.908  39.683
241      Cm   62 30.852    p 0.147     Newtons Cove      0 37.285  89.823
242      Cm   63 22.456    p 0.093     Newtons Cove      0 70.005  67.687
243      Cm   59 22.422    p 0.113     Newtons Cove      0 21.834  39.481

There are multiple rows for each crab number and I would like to be able to either average the rows for each crab number or select the first unique row for each crab number and then exclude the subsequent rows.

For example: I would like to average rows 222, 225, 239 as they are all from crab '65'; or: I would like to select row 222 and then exclude 225 and 239 as they are from a crab that has already been selected.

I have tried using unique() and sqldf() but neither have worked for me.

Any advice would be very much appreciated. Thanks!


Solution

  • For the average, you might want to try putting your data in a data.table and then applying a function:

    mydata <- data.table(mydata)
    mydata[, lapply(.SD, mean), .SDcols = c("cmass", "gmass"), by = "crab"] 
    

    Assuming you want to obtain the average for cmass and gmass.

    For the other part of your question, I'm not too sure. You could try setting a key on only the column you are interested in then call unique:

    setkey(mydata, crab)
    unique(mydata)
    

    It would then sort by crab and the unique would remove rows with duplicate values of crab. Is that what you want?