Search code examples
pythonpandasdataframeboolean-indexing

How do I index an pandas dataframe using boolean indexing?


I am starting a new practice module in pandas where we deal with indexing and filtering of data. I have come across a format of method chaining that was not explained in the course and I was wondering if anyone could help me make sense of this. The dataset is from the fortune 500 company listings.

df = pd.read_csv('f500.csv', index_col = 0)

The issue is that we have been taught to use boolean indexing by passing the bool condition to the dataframe like so;

motor_bool = df["industry"] == "Motor Vehicles and Parts"
motor_countries = df.loc[motor_bool, "country"]

The above code was to find the countries that have "Motor Vehicles and Parts" as their industries. The last exercise in the module asks us to

" Create a series, industry_usa, containing counts of the two most common values in the industry column for companies headquartered in the USA."

And the answer code is

industry_usa = f500["industry"][f500["country"] == "USA"].value_counts().head(2)

I don't understand how we can suddenly use df[col]df[col] back to back? Am I not supposed pass the bool condition first then specify which column i want to assign it to using .loc? The method chaining the used is very different to what we have practiced.

Please help. I am truly confused.

As always, thanks you, stack community.


Solution

  • I think last solution is not recommended, here better is use DataFrame.loc like second solution for get column industry by mask and then get counts:

    industry_usa = f500.loc[f500["country"] == "USA", "industry"].value_counts().head(2)
    

    Another solution with Series.nlargest:

    industry_usa = f500.loc[f500["country"] == "USA", "industry"].nlargest(2)