Search code examples
pythonperformancelistlarge-data

Python large list (~150M lists inside with 4 string elements each) convertion speeding up


Through this code I get very long performance (over 24h) working on large lists (~150M list elements inside, with 4 string elements each). I need to delete around 66M tuples from it:

def erase_random_elements(elements_list, iterable_random_numbers):
    for i in sorted(iterable_random_numbers, reverse = True):
        elements_list.pop(i)
    return elements_list

It seems I've got enough RAM for it, so I don't need to chunk the list. How can I do it faster?


Solution

  • Don't use list.pop(i) as it takes O(N) time for each delete operation (the Delete Item on the linked page). Since you have enough memory you can create a new list:

    def erase_random_elements(elements_list, iterable_random_numbers):
        s = set(iterable_random_numbers)
        elements_list[:] = [x for i, x in enumerate(elements_list) if i not in s]
        #return elements_list -> not required actually because we're updating the list in-place
    

    Note that here I used elements_list[:] because in your original function you're updating the actual list instead of creating a new one and that means all the references to that list will be intact. But if that is not required then you can simple remove the [:], un-comment the return line and assign the returned list from the function to a variable.