Below is an answer code I received from Kaggle Pandas course.
def stars(row):
if row.country == 'Canada':
return 3
elif row.points >= 95:
return 3
elif row.points >= 85:
return 2
else:
return 1
star_ratings_2 = reviews.apply(stars, axis='columns')
The question goes like this:
We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.
Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.
Create a series star_ratings with the number of stars corresponding to each review in the dataset.
The dataset looks like this: Table
My question is:
star_ratings_2 = reviews.apply(stars, axis='columns')
Why axis='columns
instead of axis='rows'
? since the stars()
functions has to process country
and points
columns of a row, shouldn't we pass a row to the stars()
function?
I just didn't expect the correct answer will be axis='columns'
, I ve asked around including ChatGPT, but there is no good answer for me. ChatGPT even think that I am right where the axis='rows'
should be correct.
The terminology is maybe misleading. However the apply
documentation is pretty clear:
axis: {0 or ‘index’, 1 or ‘columns’}, default 0
Axis along which the function is applied:
0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
You can make the parallel with aggregation functions: df.sum(axis=1)
takes each row and aggregates it into a single value. This is the same here: apply
on axis=1
/axis='columns'
takes each row and performs something.