I have a large data set (about 2000 rows and 38 columns) that looks like this (there is missing data in some columns):
species crab cmass gill gmass treatment months avglw avgils
222 Cm 65 34.273 p 0.198 Newtons Cove 0 68.108 93.181
223 Cm 57 33.506 p 0.166 Newtons Cove 0 37.908 39.683
225 Cm 65 34.273 p 0.198 Newtons Cove 0 68.108 93.181
231 Cm 62 30.852 p 0.147 Newtons Cove 0 37.285 89.823
239 Cm 65 34.273 p 0.198 Newtons Cove 0 68.108 93.181
240 Cm 57 33.506 p 0.166 Newtons Cove 0 37.908 39.683
241 Cm 62 30.852 p 0.147 Newtons Cove 0 37.285 89.823
242 Cm 63 22.456 p 0.093 Newtons Cove 0 70.005 67.687
243 Cm 59 22.422 p 0.113 Newtons Cove 0 21.834 39.481
There are multiple rows for each crab number and I would like to be able to either average the rows for each crab number or select the first unique row for each crab number and then exclude the subsequent rows.
For example: I would like to average rows 222, 225, 239 as they are all from crab '65'; or: I would like to select row 222 and then exclude 225 and 239 as they are from a crab that has already been selected.
I have tried using unique() and sqldf() but neither have worked for me.
Any advice would be very much appreciated. Thanks!
For the average, you might want to try putting your data in a data.table
and then applying a function:
mydata <- data.table(mydata)
mydata[, lapply(.SD, mean), .SDcols = c("cmass", "gmass"), by = "crab"]
Assuming you want to obtain the average for cmass
and gmass
.
For the other part of your question, I'm not too sure. You could try setting a key on only the column you are interested in then call unique
:
setkey(mydata, crab)
unique(mydata)
It would then sort by crab
and the unique
would remove rows with duplicate values of crab
. Is that what you want?