Search code examples
pythonpandasdataframeseriesboolean-indexing

Using df.loc[] vs df[] shorthand with boolean masks, pandas


Both df[booleanMask] and df.loc[booleanMask] are working for me but I don't understand why. The shorthand df[] without using .loc I thought applied to the column whereas I am trying to apply to the row, so I thought I needed to use .loc

Here is the specific code:

# Boolean operators
# All the games where a team scored at least 4 goals and won to nil
hw_4_0 = (pl23['FTHG'] >= 4) & (pl23['FTAG'] == 0)
aw_0_4 = (pl23['FTHG'] == 0) & (pl23['FTAG'] >= 4)
pl23.loc[aw_0_4 | hw_4_0]

For example, pl23.loc[aw_0_4 | hw_4_0, :] also works, but pl23.loc[:, aw_0_4 | hw_4_0] doesn't. I thought that df[boolean mask] was short hand for the latter (as with indexing), so why does it work in this instance?

Used pl23.loc[aw_0_4 | hw_4_0] which returned the data frame the query was designed for, whereas I was expecting IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).


Solution

  • df[…] vs df.loc[…] applies on columns vs index, when you use labels.

    If you pass a boolean Series (or other iterable) for boolean indexing, then they both act on the index level. To perform boolean indexing on columns, you need df.loc[:, …]

    Example:

    df = pd.DataFrame({'col1': [1, 2, 3], 'col2': [4, 5, 6]})
    
    # select "col1" in the columns
    df['col1']
    
    # select "0" in the index
    df.loc[0]
    
    
    # boolean indexing on the index
    df[df['col1'].ge(2)]
    # or
    df.loc[df['col1'].ge(2)]
    # or
    df[[False, True, True]]
    # or
    df.loc[[False, True, True]]
    
    
    # boolean indexing on the columns
    df.loc[:, df.loc[0].ge(2)]
    # or
    df.loc[:, [False, True]]