Search code examples
pythondataframepandasdata-analysis

Selecting and averaging values in pandas data frames


I am new to the pandas data analysis library. The task I need to get done is the following:

Given a data frame (table) with a set of columns C, select all the rows for which the values in a subset of columns A are identical and combine them to a single row in which the values for the other rows C \ A are averaged.

(The goal is to test multiple runs of an program and average the values like running time in those cases where the parameters are identical.)

For example:

a       b       c       d
eggs    ham     1.0     0.0
eggs    ham     0.0     0.5
spam    eggs    0.0     1.0
spam    eggs    0.0     1.0

should be combined to:

a       b       c       d
eggs    ham     0.5     0.25
spam    eggs    0.0     1.0

when selecting the columns a and b (which means that c and d should be averaged).

What are my options?


Solution

  • This is what groupby does

    http://pandas.pydata.org/pandas-docs/stable/groupby.html

    from StringIO import StringIO
    
    txt = """a       b       c       d
    eggs    ham     1.0     0.0
    eggs    ham     0.0     0.5
    spam    eggs    0.0     1.0
    spam    eggs    0.0     1.0"""
    
    df = pd.read_csv(StringIO(txt), sep="\\s+")
    df.groupby(["a","b"], as_index=False).mean()