I have a dataframe from a csv which contains userId, ISBN and ratings for a bunch of books. I want to find a subset of this dataframe in which both userIds occur more than 200 times and ISBNs occur more than 100 times.
Following is what I tried:
ratings = pd.read_csv('../data/BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
ratings.columns = ['userId', 'ISBN', 'bookRating']
# Choose users with more than 200 ratings and books with more than 100 ratings
user_rating_count = ratings['userId'].value_counts()
relevant_ratings = ratings[ratings['userId'].isin(user_rating_count[user_rating_count >= 200].index)]
print(relevant_ratings.head())
print(relevant_ratings.shape)
books_rating_count = relevant_ratings['ISBN'].value_counts()
relevant_ratings_book = relevant_ratings[relevant_ratings['ISBN'].isin(
books_rating_count[books_rating_count >= 100].index)]
print(relevant_ratings_book.head())
print(relevant_ratings_book.shape)
# Check that userId occurs more than 200 times
users_grouped = pd.DataFrame(relevant_ratings.groupby('userId')['bookRating'].count()).reset_index()
users_grouped.columns = ['userId', 'ratingCount']
sorted_users = users_grouped.sort_values('ratingCount')
print(sorted_users.head())
# Check that ISBN occurs more than 100 times
books_grouped = pd.DataFrame(relevant_ratings.groupby('ISBN')['bookRating'].count()).reset_index()
books_grouped.columns = ['ISBN', 'ratingCount']
sorted_books = books_grouped.sort_values('ratingCount')
print(sorted_books.head())
Following is the output I got:
userId ISBN bookRating
1456 277427 002542730X 10
1457 277427 0026217457 0
1458 277427 003008685X 8
1459 277427 0030615321 0
1460 277427 0060002050 0
(527556, 3)
userId ISBN bookRating
1469 277427 0060930535 0
1471 277427 0060934417 0
1474 277427 0061009059 9
1495 277427 0142001740 0
1513 277427 0312966091 0
(13793, 3)
userId ratingCount
73 26883 200
298 99955 200
826 252827 200
107 36554 200
240 83671 200
ISBN ratingCount
0 0330299891 1
132873 074939918X 1
132874 0749399201 1
132875 074939921X 1
132877 0749399295 1
As seen above when sorting the table in ascending order grouped by userId, it shows userIds only more than 200 times. But when sorting the table in ascending order grouped by ISBN, it shows ISBNs which occurs even 1 time.
I expected both userIds and ISBNs to occur more than 200 and 100 times respectively. Please let me know what I have done wrong and how to get the correct result.
You should try and produce a small version of the problem that can be solved without access to large csv files. Check this page for more details: https://stackoverflow.com/help/how-to-ask
That said, here is a dummy version of your dataset:
import pandas as pd
import random
import string
n=1000
isbn = [random.choice(['abc','def','ghi','jkl','mno']) for x in range(n)]
rating = [random.choice(range(9)) for x in range(n)]
userId = [random.choice(['x','y','z']) for x in range(n)]
df = pd.DataFrame({'isbn':isbn,'rating':rating,'userId':userId})
You can get the counts by userId and isbns this way:
df_userId_count = df.groupby('userId',as_index=False)['rating'].count()
df_isbn_count = df.groupby('isbn',as_index=False)['rating'].count()
and extract the unique values by:
userId_select = (df_userId_count[df_userId_count.rating>200].userId.values)
isbn_select = (df_isbn_count[df_isbn_count.rating>100].isbn.values)
So that your final filtered dataframe is:
df = df[df.userId.isin(userId_select) & df.isbn.isin(isbn_select) ]