Search code examples
pythonpython-3.xpandasperformancenumba

speed up drop rows base on pandas column values


I have a very large pandas data frame, which looks something like

df = pd.DataFrame({"Station": [2, 2, 2, 5, 5, 5, 6, 6],
                   "Day": [1, 2, 3, 1, 2, 3, 1, 2],
                   "Temp": [-7.0, 2.7, -1.3, -1.9, 0.2, 0.5, 1.3, 6.4]})

and I would like to as efficiently (quickly) as possible filter out all rows, which do not have exactly n rows with a certain 'Station' value.

stations =  pd.unique(df['Station'])
n = 3

def complete(temp):
    for k in range(len(stations)):
        if len(temp[temp['Station']== stations[k]].Temp) != n:
            temp.drop(temp.index[temp['Station'] == stations[k]], inplace=True)

I've been looking into using @jit(nopython=True) or Cython along the lines of this enhance pandas tutorial, but in the examples that I have found the columns are treated separately to each other. I'm wondering, is the fastest way to somehow use @jit to create a new list of v = df['Station'] only containing the rows that I want and then to use df = df[df.Station.isin(v)] to filter out the rows of the entire data frame or is there a better way?


Solution

  • Use value_counts:

    out = df[df['Station'].isin(df['Station'].value_counts().loc[lambda x: x==n].index)]
    print(out)
    
    # Output
       Station  Day  Temp
    0        2    1  -7.0
    1        2    2   2.7
    2        2    3  -1.3
    3        5    1  -1.9
    4        5    2   0.2
    5        5    3   0.5
    

    Result of value_counts:

    >>> df['Station'].value_counts()
    2    3
    5    3
    6    2
    Name: Station, dtype: int64