I have a nested list that has the following structure:
mylist = [['A', 'Car', '15'], ['A', 'Car', '15'], ['A', 'Plane', '16'], ['A', 'Bike', '20'], ['A', 'Car', '16'], ['A', 'Boat', '16']]
It's super long, with around 10 million elements. And I have many of these lists. What I want to do is:
If the third items (the string numbers) of each consecutive element from mylist
are duplicates, remove the elements that contain this duplicate.
For example:
['A', 'Car', '15']
and ['A', 'Car', '15']
are consecutive elements from mylist
, and they both contain '15'
, so they are consecutive duplicates, and one should be removed.
Similarly, ['A', 'Car', '16']
and ['A', 'Boat', '16']
are consecutive and both contain '16'
, so one should be removed.
So, what I would end up with is:
newlist = [['A', 'Car', '15'], ['A', 'Plane', '16'], ['A', 'Bike', '20'], ['A', 'Car', '16']]
I initially wrote this:
for ele in mylist:
eleindex = mylist.index(ele)
previousele = mylist[eleindex-1]
if float(ele[2]) != float(previousele[2]):
newlist.append(ele)
Unfortunately, the code I wrote took way to long for such long lists. So, I began looking online and learned that the itertools
library (using groupby
) is useful and very fast at doing these kinds of things. I then found some examples that I tried emulating, however, they were mainly for simple lists - not something a little more complicated like my situation. After tinkering around, I wasn't able to figure out how to use it for my nested lists.
So, does anyone know how to do this very quickly? Also, if you have a solution that will be faster than itertools
, that's even better!
A solution with itertools.groupby
:
from itertools import groupby
mylist = [['A', 'Car', '15'], ['A', 'Car', '15'], ['A', 'Plane', '16'], ['A', 'Bike', '20'], ['A', 'Car', '16'], ['A', 'Boat', '16']]
out = [next(g) for _, g in groupby(mylist, lambda k: k[2])]
print(out)
Prints:
[['A', 'Car', '15'], ['A', 'Plane', '16'], ['A', 'Bike', '20'], ['A', 'Car', '16']]
Benchmark (with 10_000_000 item list):
from timeit import timeit
from random import randint
from itertools import groupby
mylist = []
for i in range(10_000_000):
mylist.append(['X', 'X', str(randint(0, 20))])
def f1():
out = [next(g) for _, g in groupby(mylist, lambda k: k[2])]
return out
t1 = timeit(lambda: f1(), number=1)
print(t1)
This prints on my machine (AMD 2400G, Python 3.8):
2.408908904006239