Search code examples
pythonlistmemory-efficient

Python list appending takes long time?


I have a function to find common, uncommon items and its rates between a given list (one list) and other lists (60,000 lists) for each user (4,000 users). Running below loop takes too long time and high momery usage with partial list construction and crash. I think due to the long returned list and heavy elements (tuples), so I divided it into two functions as below , but it seems the problem in appending list items in the tuple, [(user, [items],rate),(user, [items],rate),....]. I want to create a dataframes from returned values,

What should I do to an algorithm to get around this matter and reduce memory usage?

Iam using python 3.7, windows 10, 64-bit , RAM 8G.

common items function:

def common_items(user,list1, list2):

    com_items = list(set(list1).intersection(set(list2)))
    com_items_rate = len(com_items)/len(set(list1).union(set(list2))) 
    
       
    return user, com_items, com_items_rate

uncommon items function:

def uncommon_items(user,list1, list2):

    com_items = list(set(list1).intersection(set(list2)))
    com_items_rate = len(com_items)/len(set(list1).union(set(list2))) 
    
    
    uncom_items = list(set(list2) - set(com_items)) # uncommon items that blonge to list2
    uncom_items_rate = len(uncom_items)/len(set(list1).union(set(list2)))
    
    return user, com_items_rate, uncom_items, uncom_items_rate # common_items_rate is also needed 

Constructing the list:

common_item_rate_tuple_list = [] 

for usr in users: # users.shape = 4,000
    list1 = get_user_list(usr) # a function to get list1, it takes 0:00:00.015632 or less for a user
#     print(usr, len(list1))            

    for list2 in df["list2"]: # df.shape = 60,000

        common_item_rate_tuple = common_items(usr,list1, list2) 
        common_item_rate_tuple_list.append(common_item_rate_tuple)
        
print(len(common_item_rate_tuple_list)) # 4,000 * 60,000 = 240,000,000‬ items
# sample of common_item_rate_tuple_list:
#[(1,[2,5,8], 0.676), (1,[7,4], 0.788), ....(4000,[1,5,7,9],0.318), (4000,[8,9,6],0.521)

I looked at (Memory errors and list limits?) and (Memory error when appending to list in Python) they deal with constructed list. And I couldnot deal with suggested answer for (Python list memory error).


Solution

  • There are a couple things you should consider for speed and memory management with data this big.

    • you are or should be working only with sets here because order has no meaning in your lists and you are doing a lot of intersecting of sets. So, can you change your get_user_list() function to return a set instead of a list? That will prevent all of the unnecessary conversions you are doing. Same for list2, just make a set right away
    • In your look for "uncommon items" you should just use the symmetric difference operator on the sets. Much faster, many less list -> set conversions
    • at the end of your loop, do you really want to create a list of 240M sub-lists? That is probably your memory explosion. I would suggest a dictionary with keys as user name. and you only need to create an entry in it if there are common items. If there are "sparse" matches, you will get a very much smaller data container

    --- Edit w/ example

    So I think your hope of keeping it in a data frame is too big. Perhaps you can do what is needed without storing it in a data frame. Dictionary makes sense. You may even be able to compute things "on the fly" and not store the data. Anyhow. Here is a toy example that shows the memory problem using 4K users and 10K "other lists". Of course the size of the intersected sets may make this vary, but it is informative:

    import sys
    import pandas as pd
    
    # create list of users by index
    users = list(range(4000))
    
    match_data = list()
    
    size_list2 = 10_000
    
    for user in users:
        for t in range(size_list2):
            match_data.append(( user, (1,5,6,9), 0.55))   # 4 dummy matches and fake percentage
    
    
    print(match_data[:4])
    print(f'size of match: {sys.getsizeof(match_data)/1_000_000} MB')
    
    df = pd.DataFrame(match_data)
    
    print(df.head())
    
    print(f'size of dataframe {sys.getsizeof(df)/1_000_000} MB')
    

    This yields the following:

    [(0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55), (0, (1, 5, 6, 9), 0.55)]
    size of match: 335.072536 MB
       0             1     2
    0  0  (1, 5, 6, 9)  0.55
    1  0  (1, 5, 6, 9)  0.55
    2  0  (1, 5, 6, 9)  0.55
    3  0  (1, 5, 6, 9)  0.55
    4  0  (1, 5, 6, 9)  0.55
    size of dataframe 3200.00016 MB
    

    You can see that a nutshell of your idea for only 10K other lists is 3.2GB in a dataframe. This will be unmanageable.

    Here is an idea for a data structure just to use dictionaries all the way.

    del df
    
    # just keep it in a dictionary
    data = {}   # intended format:  key= (usr, other_list) : value= [common elements]
    
    # some fake data
    user_items = {  1: {2,3,5,7,99},
                    2: {3,5,88,790},
                    3: {2,4,100} }
    
    # some fake "list 2 data"
    list2 = [   {1,2,3,4,5},
                {88, 100},
                {11, 13, 200}]
    
    for user in user_items.keys():
        for idx, other_set in enumerate(list2):     # using enumerate to get the index of the other list
            common_elements = user_items.get(user) & other_set   # set intersection
            if common_elements:  # only put it into the dictionary if it is not empty
                data[(user, idx)] = common_elements
    
    # try a couple data pulls
    print(f'for user 1 and other list 0: {data.get((1, 0))}')
    print(f'for user 2 and other list 2: {data.get((2, 2))}')  # use .get() to be safe.  It will return None if no entry
    

    The output here is:

    for user 1 and other list 0: {2, 3, 5}
    for user 2 and other list 2: None
    

    Your other alternative if you are going to be working with this data a lot is just to put these tables into a database like sqlite which is built in and won't bomb out your memory.