Search code examples
pythonpandasmean

Make multiple means from one column with pandas


With Python 3.10

Sample data:

import pandas as pd

data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
        [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
        [87,10026, 54]]  
df = pd.DataFrame(data, columns=['column', 'data', 'something'])  
print(df)  
df = df.mask(df == 0).fillna(df.mean())  
print(df)    # <---this works but you will see what I mean about looking off..

Updated Solution:
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = round(df['data'].rolling(4, 1).apply(lambda x: np.nanmean(x)), 2)
df['final2'] = np.where(df['data'] > 0, df['data'], df['ma'])
print(df)
# it replaces the zeros and NULLS with a value, (sometimes it fits well, sometimes, not so much).

The idea is I have one or more column(s) with bad or missing data.

  • If I use .fillna(df.mean()) for this it sticks out like a sore thumb.

  • My Goal is to have a percentage of the total number of elements in the dataframe column to make the new mean from...

  • I would like to take a len(df)*0.30 (30%) and use divide it in half.

  • I would collect half the numbers above the index point where the (null/0/bad data) data exists.

  • I would collect half the numbers below the index where the

  • These collected elements would be the be used to calculate the missing or bad index point.

This would be more helpful if there were a data set that irregular or had missing bad data


Solution

  • You can take a rolling mean with min periods = 1 to smooth out the data. or you can do a variant of this method to customise what you want.

    Inside the lambda i used this np.nanmean(x).

    import pandas as pd
    import numpy as np
    
    data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
            [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
            [87,10026, 54]]  
    df = pd.DataFrame(data, columns=['column', 'data', 'something'])  
    df['ma'] = df['data'].rolling(3,1).apply(lambda x : np.nanmean(x))
    df['final'] = np.where(df['data'] >= 0, df['data'], df['ma'])
    
    print(df)
    

    result:

        column     data  something            ma    final
    0        1  14890.0          3  14890.000000  14890.0
    1        4      5.0          6   7447.500000      5.0
    2        7      8.0          9   4967.666667      8.0
    3       11     13.0         14      8.666667     13.0
    4       12      0.0         18      7.000000      0.0
    5       87      NaN         54      6.500000      6.5
    6        1      0.0          3      0.000000      0.0
    7        4      5.0          6      2.500000      5.0
    8        7      8.0          9      4.333333      8.0
    9       11     13.0         14      8.666667     13.0
    10      12      0.0         18      7.000000      0.0
    11      87      NaN         54      6.500000      6.5
    12       1      0.0          3      0.000000      0.0
    13       4      5.0          6      2.500000      5.0
    14       7      8.0          9      4.333333      8.0
    15      11     13.0         14      8.666667     13.0
    16      12      0.0         18      7.000000      0.0
    17      87  10026.0         54   3346.333333  10026.0