I have a function to find common, uncommon items and its rates between a given list (one list) and other
lists (60,000 lists) for each user (4,000 users). Running below loop takes too long time and high momery usage
with partial list construction and crash. I think due to the long returned list and heavy elements (tuples),
so I divided it into two functions as below , but it seems the problem in appending list items in the tuple,
[(user, [items],rate),(user, [items],rate),....]
. I want to create a dataframes from returned values,
What should I do to an algorithm to get around this matter and reduce memory usage?
Iam using python 3.7, windows 10, 64-bit , RAM 8G.
common items function:
def common_items(user,list1, list2):
com_items = list(set(list1).intersection(set(list2)))
com_items_rate = len(com_items)/len(set(list1).union(set(list2)))
return user, com_items, com_items_rate
uncommon items function:
def uncommon_items(user,list1, list2):
com_items = list(set(list1).intersection(set(list2)))
com_items_rate = len(com_items)/len(set(list1).union(set(list2)))
uncom_items = list(set(list2) - set(com_items)) # uncommon items that blonge to list2
uncom_items_rate = len(uncom_items)/len(set(list1).union(set(list2)))
return user, com_items_rate, uncom_items, uncom_items_rate # common_items_rate is also needed
Constructing the list:
common_item_rate_tuple_list = []
for usr in users: # users.shape = 4,000
list1 = get_user_list(usr) # a function to get list1, it takes 0:00:00.015632 or less for a user
# print(usr, len(list1))
for list2 in df["list2"]: # df.shape = 60,000
common_item_rate_tuple = common_items(usr,list1, list2)
common_item_rate_tuple_list.append(common_item_rate_tuple)
print(len(common_item_rate_tuple_list)) # 4,000 * 60,000 = 240,000,000 items
# sample of common_item_rate_tuple_list:
#[(1,[2,5,8], 0.676), (1,[7,4], 0.788), ....(4000,[1,5,7,9],0.318), (4000,[8,9,6],0.521)
I looked at (Memory errors and list limits?) and (Memory error when appending to list in Python) they deal with constructed list. And I couldnot deal with suggested answer for (Python list memory error).
There are a couple things you should consider for speed and memory management with data this big.
sets
here because order has no meaning in your lists and you are doing a lot of intersecting of sets. So, can you change your get_user_list()
function to return a set instead of a list? That will prevent all of the unnecessary conversions you are doing. Same for list2, just make a set right away--- Edit w/ example
So I think your hope of keeping it in a data frame is too big. Perhaps you can do what is needed without storing it in a data frame. Dictionary makes sense. You may even be able to compute things "on the fly" and not store the data. Anyhow. Here is a toy example that shows the memory problem using 4K users and 10K "other lists". Of course the size of the intersected sets may make this vary, but it is informative:
import sys
import pandas as pd
# create list of users by index
users = list(range(4000))
match_data = list()
size_list2 = 10_000
for user in users:
for t in range(size_list2):
match_data.append(( user, (1,5,6,9), 0.55)) # 4 dummy matches and fake percentage
print(match_data[:4])
print(f'size of match: {sys.getsizeof(match_data)/1_000_000} MB')
df = pd.DataFrame(match_data)
print(df.head())
print(f'size of dataframe {sys.getsizeof(df)/1_000_000} MB')
This yields the following:
[(0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55)]
size of match: 335.072536 MB
0 1 2
0 0 (1, 5, 6, 9) 0.55
1 0 (1, 5, 6, 9) 0.55
2 0 (1, 5, 6, 9) 0.55
3 0 (1, 5, 6, 9) 0.55
4 0 (1, 5, 6, 9) 0.55
size of dataframe 3200.00016 MB
You can see that a nutshell of your idea for only 10K other lists is 3.2GB in a dataframe. This will be unmanageable.
Here is an idea for a data structure just to use dictionaries all the way.
del df
# just keep it in a dictionary
data = {} # intended format: key= (usr, other_list) : value= [common elements]
# some fake data
user_items = { 1: {2,3,5,7,99},
2: {3,5,88,790},
3: {2,4,100} }
# some fake "list 2 data"
list2 = [ {1,2,3,4,5},
{88, 100},
{11, 13, 200}]
for user in user_items.keys():
for idx, other_set in enumerate(list2): # using enumerate to get the index of the other list
common_elements = user_items.get(user) & other_set # set intersection
if common_elements: # only put it into the dictionary if it is not empty
data[(user, idx)] = common_elements
# try a couple data pulls
print(f'for user 1 and other list 0: {data.get((1, 0))}')
print(f'for user 2 and other list 2: {data.get((2, 2))}') # use .get() to be safe. It will return None if no entry
The output here is:
for user 1 and other list 0: {2, 3, 5}
for user 2 and other list 2: None
Your other alternative if you are going to be working with this data a lot is just to put these tables into a database like sqlite
which is built in and won't bomb out your memory.