I have a big list of tuples (20-30 million entries). Each sub-tuple has three pieces of information (start, end, value)
and is supposed to only represent one position in a sequence. However, the program generating this data has concatenated some of the values of adjacent positions if the value is the same. Thus they look a like this:
[..., (71, 72, -0.2998250126838684),
(72, 73, -0.2858070135116577),
(73, 74, -0.29049500823020935),
(74, 75, -0.3044680058956146),
(75, 76, -0.28386199474334717),
(76, 80, -0.27730199694633484), # PROBLEM: end - start > 1
(80, 81, -0.2726449966430664),
(81, 82, -0.26151400804519653),
(82, 84, -0.2679719924926758), # PROBLEM: end - start > 1
(84, 85, -0.24273300170898438),
(85, 86, -0.23799900710582733),
(86, 87, -0.24745100736618042),
(87, 88, -0.2568419873714447),
(88, 89, -0.2585819959640503), ...]
To fix this I would like to create new entries in the list which separate these tuples which represent multiple positions into new tuples which represent just one position. I thus desire an output like this:
(..., (71, 72, -0.2998250126838684),
(72, 73, -0.2858070135116577),
(73, 74, -0.29049500823020935),
(74, 75, -0.3044680058956146),
(75, 76, -0.28386199474334717),
(76, 77, -0.27730199694633484), # New
(77, 78, -0.27730199694633484), # New
(78, 79, -0.27730199694633484), # New
(79, 80, -0.27730199694633484), # New
(80, 81, -0.2726449966430664),
(81, 82, -0.26151400804519653),
(82, 83, -0.2679719924926758), # New
(83, 84, -0.2679719924926758), # New
(84, 85, -0.24273300170898438),
(85, 86, -0.23799900710582733),
(86, 87, -0.24745100736618042),
(87, 88, -0.2568419873714447),
(88, 89, -0.2585819959640503), ...)
To do this, with my list of tuples called bwList
I have done the following:
replacementTuple = ()
for t in bwList:
if t[1] - t[0] == 1:
replacementTuple = replacementTuple + (t,)
else:
numNewTuples = t[1] - t[0]
st, ed = range(t[0], t[1]), range(t[0] + 1, t[1] + 1)
for m in range(numNewTuples):
replacementTuple = replacementTuple + ((st[m], ed[m], t[2]) ,)
Here the output is a tuple of tuples, not a list. I don't mind too much either way.
This approach appears to work, but is very slow!
Is there a way to speed this up?
Dramatically simpler (and likely faster, mostly by virtue of amortized O(1)
appends) would be to build a list
that expands each tuple
(possibly to the same tuple
just once, but without special-casing):
replacement_list = [(newstart, newstart+1, value)
for start, end, value in bwList
for newstart in range(start, end)]
You can convert that to a tuple
at the end if you like, but building it as a list
first will save a lot of trouble.