Search code examples
rmergeplyr

Calculate "group characteristics" without ddply and merge


I wonder whether there is a more straighforward way to calculate a certain type of variables than the approach i normally take....

The example below probably explains it best. I have a dataframe with 2 columns (fruit and whether the fruit is rotten or not). I would like to, for each row, add e.g. the percentage of fruit of the same category that is rotten. For example, there are 4 entries for apples, 2 of them are rotten, so each row for apple should read 0.5. The target values (purely as illustration) are included in the "desired outcome" column.

I have previously approached this problem by * using the "ddply" command on the fruit variable (with sum/lenght as function), creating a new 3*2 dataframe * use the "merge" command to link these values back into the old dataframe.

This feels like a roundabout way, and I was wondering whether there are better/faster way of doing this! ideallly a generic approach, that is easily adjusted if one instead of the percentage needs to determine whether e.g. all fruits are rotten, any fruits are rotten, etc. etc. etc....

Many thanks in advance,

W

    Fruit Rotten Desired_Outcome_PercRotten
1   Apple      1                        0.5
2   Apple      1                        0.5
3   Apple      0                        0.5
4   Apple      0                        0.5
5    Pear      1                       0.75
6    Pear      1                       0.75
7    Pear      1                       0.75
8    Pear      0                       0.75
9  Cherry      0                          0
10 Cherry      0                          0
11 Cherry      0                          0

#create example datagram; desired outcome columns are purely inserted as illustrative of target outcomes
Fruit=c(rep("Apple",4),rep("Pear",4),rep("Cherry",3))
Rotten=c(1,1,0,0,1,1,1,0,0,0,0)
Desired_Outcome_PercRotten=c(0.5,0.5,0.5,0.5,0.75,0.75,0.75,0.75,0,0,0)
df=as.data.frame(cbind(Fruit,Rotten,Desired_Outcome_PercRotten))        
df

Solution

  • You can do this with just ddply and mutate:

    # changed summarise to transform on joran's suggestion
    # changed transform to mutate on mnel's suggestion :)
    ddply(df, .(Fruit), mutate, Perc = sum(Rotten)/length(Rotten))
    
    #     Fruit Rotten Perc
    # 1   Apple      1 0.50
    # 2   Apple      1 0.50
    # 3   Apple      0 0.50
    # 4   Apple      0 0.50
    # 5  Cherry      0 0.00
    # 6  Cherry      0 0.00
    # 7  Cherry      0 0.00
    # 8    Pear      1 0.75
    # 9    Pear      1 0.75
    # 10   Pear      1 0.75
    # 11   Pear      0 0.75