Search code examples
pythonpandasdataframemean

How to calculate mean of specific rows in python dataframe?


I have a dataframe with 11 000k rows. There are multiple columns but I am interested only in 2 of them: Name and Value. One name can repeat itself multiple times among rows. I want to calculate the average value for each name and create a new dataframe with the average value for each name. I don't really know how to walk through rows and how to calculate the average. Any help will be highly appreciated. Thank you!

Name     DataType    TimeStamp           Value Quality
Food   Float       2019-01-01 13:00:00  105.75      122
Food    Float      2019-01-01 17:30:00  11.8110352  122
Food   Float       2019-01-01 17:45:00  12.7932892      122
Water       Float      2019-01-01 14:01:00  16446.875   122
Water       Float      2019-01-01 14:00:00  146.875     122

RangeIndex: 11140487 entries, 0 to 11140486
Data columns (total 6 columns):
Name              object
Value        object

This is what I have and I know it is really noob ish but I am having a difficult time walking through rows.

for i in range(0, len(df):
if((df.iloc[i]['DataType']!='Undefined')):
    print df.loc[df['Name'] == df.iloc[i]['Name'], df.iloc[i]['Value']].mean()
  
   

Solution

  • You should avoid as much as possible to iterate rows in a dataframe, because it is very unefficient...

    groupby is the way to go when you want to apply the same processing to various groups of rows identified by their values in one or more columns. Here what you want is (*):

    df.groupby('TagName')['Sample_value'].mean().reset_index()
    

    it gives as expected:

         TagName  Sample_value
    0      Steam  1.081447e+06
    1  Utilities  3.536931e+05
    

    Details on the magic words:

    • groupby: identifies the column(s) used to group the rows (same values)
    • ['Sample_values']: restrict the groupby object to the column of interest
    • mean(): computes the mean per group
    • reset_index(): by default the grouping columns go into the index, which is fine for the mean operation. reset_index make them back normal columns