Search code examples
ranovalong-format-datawide-format-data

Can an ANOVA be calculated using multiple columns?


Can an ANOVA be carried out using a dataframe looking like this?

category_1 category_2 category_4 category_5
0.75 0.82 0.91 0.32
0.71 0.39 0.21 0.76
0.17 0.10 0.43 0.37

I already tried using unlist to transform the data into a long format. However, the column names will be in a column without a name in that case and have an extra number tied to them. Then, it should not be possible to use an ANOVA. Is there another way?

"category_x" is the grouping variable, and I want to check whether some categories are used more often than others (higher category score = used more often).


Solution

  • Let us recreate your data frame and call it df:

    df <- read.table(text = '
      category_1 category_2 category_4 category_5
    1       0.75       0.82       0.91       0.32
    2       0.71       0.39       0.21       0.76
    3       0.17       0.10       0.43       0.37')
    

    To get these data in a suitable format for ANOVA, we can pivot to long format. This puts all the values in one column, and creates another column that labels each value according to its original column. We can use pivot_longer from the tidyverse to do this

    library(tidyverse)
    
    df <- pivot_longer(df, everything(), names_to = 'Category', values_to = 'Value') 
    

    Now our data frame looks like this:

    df
    #> # A tibble: 12 x 2
    #>    Category   Value
    #>    <chr>      <dbl>
    #>  1 category_1  0.75
    #>  2 category_2  0.82
    #>  3 category_4  0.91
    #>  4 category_5  0.32
    #>  5 category_1  0.71
    #>  6 category_2  0.39
    #>  7 category_4  0.21
    #>  8 category_5  0.76
    #>  9 category_1  0.17
    #> 10 category_2  0.1 
    #> 11 category_4  0.43
    #> 12 category_5  0.37
    

    We can now create a linear model of the values according to category and review the summary:

    model <- lm(Value ~ Category, data = df)
    
    summary(model)
    #> 
    #> Call:
    #> lm(formula = Value ~ Category, data = df)
    #> 
    #> Residuals:
    #>      Min       1Q   Median       3Q      Max 
    #> -0.37333 -0.19917 -0.06667  0.22417  0.39333 
    #> 
    #> Coefficients:
    #>                    Estimate Std. Error t value Pr(>|t|)  
    #> (Intercept)         0.54333    0.18760   2.896    0.020 *
    #> Categorycategory_2 -0.10667    0.26531  -0.402    0.698  
    #> Categorycategory_4 -0.02667    0.26531  -0.101    0.922  
    #> Categorycategory_5 -0.06000    0.26531  -0.226    0.827  
    #> ---
    #> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    #> 
    #> Residual standard error: 0.3249 on 8 degrees of freedom
    #> Multiple R-squared:  0.02204,    Adjusted R-squared:  -0.3447 
    #> F-statistic: 0.06009 on 3 and 8 DF,  p-value: 0.9794
    

    Finally, we can run our model through anova

    anova(model)
    #> Analysis of Variance Table
    #> 
    #> Response: Value
    #>           Df  Sum Sq  Mean Sq F value Pr(>F)
    #> Category   3 0.01903 0.006344  0.0601 0.9794
    #> Residuals  8 0.84467 0.105583
    

    Created on 2022-06-12 by the reprex package (v2.0.1)