I have a very large pandas data frame, which looks something like
df = pd.DataFrame({"Station": [2, 2, 2, 5, 5, 5, 6, 6],
"Day": [1, 2, 3, 1, 2, 3, 1, 2],
"Temp": [-7.0, 2.7, -1.3, -1.9, 0.2, 0.5, 1.3, 6.4]})
and I would like to as efficiently (quickly) as possible filter out all rows, which do not have exactly n rows with a certain 'Station' value.
stations = pd.unique(df['Station'])
n = 3
def complete(temp):
for k in range(len(stations)):
if len(temp[temp['Station']== stations[k]].Temp) != n:
temp.drop(temp.index[temp['Station'] == stations[k]], inplace=True)
I've been looking into using @jit(nopython=True) or Cython along the lines of this enhance pandas tutorial, but in the examples that I have found the columns are treated separately to each other. I'm wondering, is the fastest way to somehow use @jit to create a new list of v = df['Station']
only containing the rows that I want and then to use df = df[df.Station.isin(v)]
to filter out the rows of the entire data frame or is there a better way?
Use value_counts
:
out = df[df['Station'].isin(df['Station'].value_counts().loc[lambda x: x==n].index)]
print(out)
# Output
Station Day Temp
0 2 1 -7.0
1 2 2 2.7
2 2 3 -1.3
3 5 1 -1.9
4 5 2 0.2
5 5 3 0.5
Result of value_counts
:
>>> df['Station'].value_counts()
2 3
5 3
6 2
Name: Station, dtype: int64