How to filter and a find a subset of a dataframe in which categorical data in two columns occur more than n, m times

I have a dataframe from a csv which contains userId, ISBN and ratings for a bunch of books. I want to find a subset of this dataframe in which both userIds occur more than 200 times and ISBNs occur more than 100 times.

Following is what I tried:

ratings = pd.read_csv('../data/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userId', 'ISBN', 'bookRating']

# Choose users with more than 200 ratings and books with more than 100 ratings

user_rating_count = ratings['userId'].value_counts()
relevant_ratings = ratings[ratings['userId'].isin(user_rating_count[user_rating_count >= 200].index)]
print(relevant_ratings.head())
print(relevant_ratings.shape)
books_rating_count = relevant_ratings['ISBN'].value_counts()
relevant_ratings_book = relevant_ratings[relevant_ratings['ISBN'].isin(
    books_rating_count[books_rating_count >= 100].index)]
print(relevant_ratings_book.head())
print(relevant_ratings_book.shape)

# Check that userId occurs more than 200 times

users_grouped = pd.DataFrame(relevant_ratings.groupby('userId')['bookRating'].count()).reset_index()
users_grouped.columns = ['userId', 'ratingCount']
sorted_users = users_grouped.sort_values('ratingCount')
print(sorted_users.head())

# Check that ISBN occurs more than 100 times

books_grouped = pd.DataFrame(relevant_ratings.groupby('ISBN')['bookRating'].count()).reset_index()
books_grouped.columns = ['ISBN', 'ratingCount']
sorted_books = books_grouped.sort_values('ratingCount')
print(sorted_books.head())

Following is the output I got:

      userId        ISBN  bookRating
1456  277427  002542730X          10
1457  277427  0026217457           0
1458  277427  003008685X           8
1459  277427  0030615321           0
1460  277427  0060002050           0
(527556, 3)
      userId        ISBN  bookRating
1469  277427  0060930535           0
1471  277427  0060934417           0
1474  277427  0061009059           9
1495  277427  0142001740           0
1513  277427  0312966091           0
(13793, 3)
     userId  ratingCount
73    26883          200
298   99955          200
826  252827          200
107   36554          200
240   83671          200
               ISBN  ratingCount
0        0330299891            1
132873   074939918X            1
132874   0749399201            1
132875   074939921X            1
132877   0749399295            1

As seen above when sorting the table in ascending order grouped by userId, it shows userIds only more than 200 times. But when sorting the table in ascending order grouped by ISBN, it shows ISBNs which occurs even 1 time.

I expected both userIds and ISBNs to occur more than 200 and 100 times respectively. Please let me know what I have done wrong and how to get the correct result.

Solution

You should try and produce a small version of the problem that can be solved without access to large csv files. Check this page for more details: https://stackoverflow.com/help/how-to-ask

That said, here is a dummy version of your dataset:

import pandas as pd
import random
import string
n=1000
isbn = [random.choice(['abc','def','ghi','jkl','mno']) for x in range(n)]
rating = [random.choice(range(9)) for x in range(n)]
userId = [random.choice(['x','y','z']) for x in range(n)]
df = pd.DataFrame({'isbn':isbn,'rating':rating,'userId':userId})

You can get the counts by userId and isbns this way:

df_userId_count = df.groupby('userId',as_index=False)['rating'].count()
df_isbn_count = df.groupby('isbn',as_index=False)['rating'].count()

and extract the unique values by:

userId_select = (df_userId_count[df_userId_count.rating>200].userId.values)
isbn_select = (df_isbn_count[df_isbn_count.rating>100].isbn.values)

So that your final filtered dataframe is:

df = df[df.userId.isin(userId_select) & df.isbn.isin(isbn_select) ]