Search code examples
pythonpandasdataframeseriesmethod-chaining

Is there a query method or similar for pandas Series (pandas.Series.query())?


The pandas.DataFrame.query() method is of great usage for (pre/post)-filtering data when loading or plotting. It comes particularly handy for method chaining.

I find myself often wanting to apply the same logic to a pandas.Series, e.g. after having done a method such as df.value_counts which returns a pandas.Series.

Example

Lets assume there is a huge table with the columns Player, Game, Points and I want to plot a histogram of the players with more than 14 times 3 points. I first have to sum the points of each player (groupby -> agg) which will return a Series of ~1000 players and their overall points. Applying the .query logic it would look something like this:

df = pd.DataFrame({
    'Points': [random.choice([1,3]) for x in range(100)], 
    'Player': [random.choice(["A","B","C"]) for x in range(100)]})

(df
     .query("Points == 3")
     .Player.values_count()
     .query("> 14")
     .hist())

The only solutions I find force me to do an unnecessary assignment and break the method chaining:

(points_series = df
     .query("Points == 3")
     .groupby("Player").size()
points_series[points_series > 100].hist()

Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly.

# just to make my point :)
series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape

Please help me out of my dilemma! Thanks


Solution

  • If I understand correctly you can add query("Points > 100"):

    df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf],
                       'Player':['a','a','a','s','s','s']})
    
    print (df)
      Player     Points
    0      a  50.000000
    1      a  20.000000
    2      a  38.000000
    3      s  90.000000
    4      s   0.000000
    5      s        inf
    
    points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points']
    print (points_series)     
    a = points_series[points_series > 100]
    print (a)     
    Player
    a    108.0
    Name: Points, dtype: float64
    
    
    points_series = df.query("Points < inf")
                      .groupby("Player")
                      .agg({"Points": "sum"})
                      .query("Points > 100")
    
    print (points_series)     
            Points
    Player        
    a        108.0
    

    Another solution is Selection By Callable:

    points_series = df.query("Points < inf")
                      .groupby("Player")
                      .agg({"Points": "sum"})['Points']
                      .loc[lambda x: x > 100]
    
    print (points_series)     
    Player
    a    108.0
    Name: Points, dtype: float64
    

    Edited answer by edited question:

    np.random.seed(1234)
    df = pd.DataFrame({
        'Points': [np.random.choice([1,3]) for x in range(100)], 
        'Player': [np.random.choice(["A","B","C"]) for x in range(100)]})
    
    print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15])
    C    19
    B    16
    Name: Player, dtype: int64
    
    print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15])
    Player
    B    16
    C    19
    dtype: int64