Search code examples
pandasdata-structuresnested-loopsconfidence-intervalstatistics-bootstrap

Pandas, compute many means with bootstrap confidence intervals for plotting


I want to compute means with bootstrap confidence intervals for some subsets of a dataframe; the ultimate goal is to produce bar graphs of the means with bootstrap confidence intervals as the error bars. My data frame looks like this:

ATG12 Norm     ATG5 Norm    ATG7 Norm    Cancer Stage    
5.55           4.99         8.99         IIA
4.87           5.77         8.88         IIA
5.98           7.88         8.34         IIC

The subsets I'm interested in are every combination of Norm columns and cancer stage. I've managed to produce a table of means using:

df.groupby('Cancer Stage')['ATG12 Norm', 'ATG5 Norm', 'ATG7 Norm'].mean()

But I need to compute bootstrap confidence intervals to use as error bars for each of those means using the approach described here: http://www.randalolson.com/2012/08/06/statistical-analysis-made-easy-in-python/ It boils down to:

import scipy
import scikits.bootstraps as bootstraps
CI = bootstrap.ci(data=Series, statfunction=scipy.mean)
# CI[0] and CI[1] are your low and high confidence intervals

I tried to apply this method to each subset of data with a nested-loop script:

for i in data.groupby('Cancer Stage'):
    for p in i.columns[1:3]: # PROBLEM!!
        Series = i[p]
        print p
        print Series.mean()
        ci = bootstrap.ci(data=Series, statfunction=scipy.mean)

Which produced an error message

AttributeError: 'tuple' object has no attribute called 'columns' 

Not knowing what "tuples" are, I have some reading to do but I'm worried that my current approach of nested for loops will leave me with some kind of data structure I won't be able to easily plot from. I'm new to Pandas so I wouldn't be surprised to find there's a simpler, easier way to produce the data I'm trying to graph. Any and all help will be very much appreciated.


Solution

  • The way you iterate over the groupby-object is wrong! When you use groupby(), your data frame is sliced along the values in your groupby-column(s), together with these values as group names, forming a so-called "tuple": (name, dataforgroup). The correct recipe for iterating over groupby-objects is

    for name, group in data.groupby('Cancer Stage'):
        print name
        for p in group.columns[0:3]:
        ...
    

    Please read more about the groupby-functionality of pandas here and go through the python-reference in order to understand what tuples are!

    Grouping data frames and applying a function is essentially done in one statement, using the apply-functionality of pandas:

    cols=data.columns[0:2]
    for col in columns:
        print data.groupby('Cancer Stage')[col].apply(lambda x:bootstrap.ci(data=x, statfunction=scipy.mean))
    

    does everything you need in one line, and produces a (nicely plotable) series for you

    EDIT: I toyed around with a data frame object I created myself:

    df = pd.DataFrame({'A':range(24), 'B':list('aabb') * 6, 'C':range(15,39)})
    for col in ['A', 'C']:
        print df.groupby('B')[col].apply(lambda x:bootstrap.ci(data=x.values))
    

    yields two series that look like this:

    B
    a    [6.58333333333, 14.3333333333]
    b                      [8.5, 16.25]
    
    B
    a    [21.5833333333, 29.3333333333]
    b            [23.4166666667, 31.25]