Search code examples
pythonregexzero-padding

Is it possible to use regular expressions to find a pattern which is padded with zeros, and return the value without padding?


I have a list of reference IDs which are alphanumeric. They have 3 digits which are zero-padded to the left, followed by a letter, followed by 3 more digits, again, zero padded to the left.

eg.

original_ref_list = ["005a004",
                     "018b003",
                     "007a029",
                     "105a015"]

As you can see, both sets of digits are padded with zeros. I want to get the same references without the zero padding on either side of the letter, but not to remove all zeros.

eg.

fixed_ref_list = ["5a4",
                  "18b3",
                  "7a29",
                  "105a15"]

I can do this by by searching for three regex patterns, combining the results and appending this to a list:

fixed_ref_list = list()
for i in original_ref_list:
    first_refpat = re.compile(r'[1-9]\d*[a-z]\d+')
    first_refpatiter = first_refpat.finditer(gloss[2])
    for first_ref_find in first_refpatiter:
        first_ref = first_ref_find.group()
        second_refpat = re.compile(r'[a-z]\d+')
        second_refpatiter = second_refpat.finditer(first_ref)
        for second_ref_find in second_refpatiter:
            second_ref = second_ref_find.group()[1:]
            third_refpat = re.compile(r'[1-9]\d*')
            third_refpatiter = third_refpat.finditer(second_ref)
            for third_ref_find in third_refpatiter:
                third_ref = third_ref_find.group()
    fixed_ref_list.append(first_ref[:-len(second_ref)] + third_ref)

But this seems like an awkward solution. Is there a built in way to return only part of a regex pattern, or to remove the padding before returning the result? Alternatively, is there any way to do what I want that's less messy?


Solution

  • You can just group your matches using parenthesis like this:

    re.match('([0-9a-f]{3})([0-9a-f])([0-9a-f]{3})', '005a004').groups()
    > ('005', 'a', '004')
    

    Now you have a tuple to work with. To remove the zeros in the beginning, you can match all the 0s using the ^ operator, which marks the beginning of a string and replace them with an empty string '':

    re.sub('^0+', '', '004')
    > '4'
    

    That should give you all you need to make this more compact and readable.