Say I have a string:
teststring = "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!"
That I would like as:
testlist = ["1.3 Hello how are you", "1.4 I am fine, thanks 1.2 Hi There", "1.5 Great!"]
Basically, splitting only on increasing digits where the difference is .1 (i.e. 1.2 to 1.3).
Is there a way to split this with regex but only capturing increasing sequential numbers? I wrote code in python to sequentially iterate through using a custom re.compile() for each one and it is okay but extremely unwieldy.
Something like this (where parts1_temp is a given list of the x.x. numbers in the string):
parts1_temp = ['1.3','1.4','1.2','1.5']
parts_num = range(int(parts1_temp.split('.')[1]), int(parts1_temp.split('.')[1])+30)
parts_search = ['.'.join([parts1_temp.split('.')[0], str(parts_num_el)]) for parts_num_el in parts_num]
#parts_search should be ['1.3','1.4','1.5',...,'1.32']
for k in range(len(parts_search)-1):
rxtemp = re.compile(r"(?:"+str(parts_search[k])+")([\s\S]*?)(?=(?:"+str(parts_search[k+1])+"))", re.MULTILINE)
parts_fin = [match.group(0) for match in rxtemp.finditer(teststring)]
But man is it ugly. Is there a way to do this more directly in regex? I imagine this is feature that someone would have wanted at some point with regex but I can't find any ideas on how to tackle this (and maybe it is not possible with pure regex).
This method uses finditer
to find all locations of \d+\.\d+
, then tests whether the match was numerically greater than the previous. If the test is true it appends the index to the indices
array.
The last line uses list comprehension as taken from this answer to split the string on those given indices.
This method ensures the previous match is smaller than the current one. This doesn't work sequentially, instead, it works based on number size. So assuming a string has the numbers 1.1, 1.2, 1.4
, it would split on each occurrence since each number is larger than the last.
import re
indices = []
string = "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!"
regex = re.compile(r"\d+\.\d+")
lastFloat = 0
for m in regex.finditer(string):
x = float(m.group())
if lastFloat < x:
lastFloat = x
indices.append(m.start(0))
print([string[i:j] for i,j in zip(indices, indices[1:]+[None])])
Outputs: ['1.3 Hello how are you ', '1.4 I am fine, thanks 1.2 Hi There ', '1.5 Great!']
This method is very similar to the original, however, on the case of 1.1, 1.2, 1.4
, it wouldn't split on 1.4
since it doesn't follow sequentially given the .1
sequential separator.
The method below only differs in the if
statement, so this logic is fairly customizable to whatever your needs may be.
import re
indices = []
string = "1.3 Hello how are you 1.4 I am fine, thanks 1.2 Hi There 1.5 Great!"
regex = re.compile(r"\d+\.\d+")
lastFloat = 0
for m in regex.finditer(string):
x = float(m.group())
if (lastFloat == 0) or (x == round(lastFloat + .1, 1)):
lastFloat = x
indices.append(m.start(0))
print([string[i:j] for i,j in zip(indices, indices[1:]+[None])])