python python-3.x pandas performance numba

speed up drop rows base on pandas column values

I have a very large pandas data frame, which looks something like

df = pd.DataFrame({"Station": [2, 2, 2, 5, 5, 5, 6, 6],
                   "Day": [1, 2, 3, 1, 2, 3, 1, 2],
                   "Temp": [-7.0, 2.7, -1.3, -1.9, 0.2, 0.5, 1.3, 6.4]})

and I would like to as efficiently (quickly) as possible filter out all rows, which do not have exactly n rows with a certain 'Station' value.

stations =  pd.unique(df['Station'])
n = 3

def complete(temp):
    for k in range(len(stations)):
        if len(temp[temp['Station']== stations[k]].Temp) != n:
            temp.drop(temp.index[temp['Station'] == stations[k]], inplace=True)

I've been looking into using @jit(nopython=True) or Cython along the lines of this enhance pandas tutorial, but in the examples that I have found the columns are treated separately to each other. I'm wondering, is the fastest way to somehow use @jit to create a new list of v = df['Station'] only containing the rows that I want and then to use df = df[df.Station.isin(v)] to filter out the rows of the entire data frame or is there a better way?

Solution

Use value_counts:

out = df[df['Station'].isin(df['Station'].value_counts().loc[lambda x: x==n].index)]
print(out)

# Output
   Station  Day  Temp
0        2    1  -7.0
1        2    2   2.7
2        2    3  -1.3
3        5    1  -1.9
4        5    2   0.2
5        5    3   0.5

Result of value_counts:

>>> df['Station'].value_counts()
2    3
5    3
6    2
Name: Station, dtype: int64