Search code examples
python-3.xstatsmodelsanova

How can I fix statsmodels.AnovaRM claiming "Data is unbalanced" although it isn't?


I am trying to perform a three-way repeated measurements ANOVA with statsmodels.AnovaRM, but there is already a hindrance while performing a two-way ANOVA: When running

aov = AnovaRM(anova_df, depvar='Test', subject='Subject',
    within=["Factor1", "Factor2"], aggregate_func='mean').fit()
print(aov)

it returns "Data is unbalanced.". Let's look at the factors I extracted from the DataFrame that I fed into it:

Factor1, level 0, shape: (68, 6)
Factor1, level 1, shape: (68, 6)
Factor1, level 2, shape: (68, 6)
Factor2, level a, shape: (68, 6)
Factor2, level b, shape: (68, 6)
Factor2, level c, shape: (68, 6)

Because this is a test, I even aligned the Factors with each other.

   Test Factor1 Factor 2
0   32.6    0   a
1   39.3    1   b
2   43.0    2   c
3   32.0    0   a
4   32.8    1   b
5   38.3    2   c
6   36.7    0   a
7   40.4    1   b
8   41.9    2   c

How is that not being balanced? What am I doing wrong, how can I fix this?


Solution

  • I run into the same issue. A dataset that AnovaRM runs with and works is in this tutorial: https://pythontutorials.eu/numerical/statistics-solution-1/

    I also used your method of checking the shapes iterating through all the levels of all the variables. The output also showed everything has the same shape. The dataset in the link above also has this feature.

    It turned out that having the same shape is not enough. For the variable you use for subject, in your input df, if you run something like df[subject_name].value_counts() every unique subject_name has to have the same number. If the numbers are different, the AnovaRM will give you an unbalanced data error.

    I used this checking method on my df and it showed that some subjects have fewer values than others, while when checking on the example df from the link above, every subject has the same number of values. Furthermore, I manually subset my df to include the subjects that have the same number of values/measurements, and AnovaRM worked for me. Have a try and let me know whether this helps you understand what unbalancing really means.