Search code examples
rmeancategorical-datacontinuousna.rm

How to find the mean of a continuous variable for each categorical variable?


I am trying to calculate the average duration of UFO sighting (continuous) for each categorical shape that it is related with. Essentially, what is the average sighting length for each UFO shape?

I tried:

    a <- aggregate(duration..seconds. ~ shape, data=alien, FUN=mean, na.rm=TRUE)
    barplot(a$duration..seconds., names.arg=a$shape)

and got:

    no non-missing arguments to min; returning Infno non-missing arguments to max; 
    returning -InfError in plot.window(xlim, ylim, log = log, ...) : need finite 'ylim' values

I realize that I need to alter my data somehow. I would like to simply remove all of the data that has missing corresponding data (ie, we know the shape but the duration is missing - and vice versa), but I don't quite know how to do this.

Thanks for your help!

PS. the "duration..seconds." is correct, that is how it transferred over from the excel file.

    shape       duration..seconds.
    us  changing    3600    NA  4/27/2004   29.8830556  
    us  changing    300     NA  12/16/2005  29.38421    
    us  changing    3600    NA  1/21/2008   53.2    
    us  changing    900     NA  1/17/2004   28.9783333  
    ca  changing    1200    NA  1/22/2004   21.4180556  
    us  changing    3600    NA  4/27/2007   36.595  

There are 80000 logs of UFO sightings, which is why I am trying to average it. And there are 29 different shapes.


Solution

  • Data

    df <- read.table(text="
    country shape  duration_seconds dummy1 date dummy2
    us  changing    3600    NA  4/27/2004   29.8830556  
    us  changing    300     NA  12/16/2005  29.38421    
    us  changing    3600    NA  1/21/2008   53.2    
    us  changing    900     NA  1/17/2004   28.9783333  
    ca  changing    1200    NA  1/22/2004   21.4180556  
    us  changing    3600    NA  4/27/2007   36.595  
    ", header = TRUE, stringsAsFactors = FALSE)
    

    You can fix the column titles with

    names(df) <- c("country", "shape", "duration_seconds", "dummy1", "date", "dummy2")
    

    Using library dplyr

    library(dplyr)
    df %>% 
      group_by(shape)  %>%
      summarize(mean_duration_seconds = mean(duration_seconds))
    
    #   shape    mean_duration_seconds
    #   <chr>                    <dbl>
    # 1 changing                 2200.
    

    And using the original code

    names(df) <- c("country", "shape", "duration_seconds", "dummy1", "date", "dummy2")
    a <- aggregate(duration_seconds ~ shape, data=df, FUN=mean, na.rm=TRUE)
    barplot(a$duration_seconds, names.arg=a$shape)
    
    a
    #   shape    duration_seconds
    # 1 changing             2200