I have a list of reference IDs which are alphanumeric. They have 3 digits which are zero-padded to the left, followed by a letter, followed by 3 more digits, again, zero padded to the left.
eg.
original_ref_list = ["005a004",
"018b003",
"007a029",
"105a015"]
As you can see, both sets of digits are padded with zeros. I want to get the same references without the zero padding on either side of the letter, but not to remove all zeros.
eg.
fixed_ref_list = ["5a4",
"18b3",
"7a29",
"105a15"]
I can do this by by searching for three regex patterns, combining the results and appending this to a list:
fixed_ref_list = list()
for i in original_ref_list:
first_refpat = re.compile(r'[1-9]\d*[a-z]\d+')
first_refpatiter = first_refpat.finditer(gloss[2])
for first_ref_find in first_refpatiter:
first_ref = first_ref_find.group()
second_refpat = re.compile(r'[a-z]\d+')
second_refpatiter = second_refpat.finditer(first_ref)
for second_ref_find in second_refpatiter:
second_ref = second_ref_find.group()[1:]
third_refpat = re.compile(r'[1-9]\d*')
third_refpatiter = third_refpat.finditer(second_ref)
for third_ref_find in third_refpatiter:
third_ref = third_ref_find.group()
fixed_ref_list.append(first_ref[:-len(second_ref)] + third_ref)
But this seems like an awkward solution. Is there a built in way to return only part of a regex pattern, or to remove the padding before returning the result? Alternatively, is there any way to do what I want that's less messy?
You can just group your matches using parenthesis like this:
re.match('([0-9a-f]{3})([0-9a-f])([0-9a-f]{3})', '005a004').groups()
> ('005', 'a', '004')
Now you have a tuple to work with. To remove the zeros in the beginning, you can match all the 0s using the ^
operator, which marks the beginning of a string and replace them with an empty string ''
:
re.sub('^0+', '', '004')
> '4'
That should give you all you need to make this more compact and readable.