Search code examples
pythonregexleading-zero

Remove leading zeros in middle of string with regex


I have a large number of strings on the format YYYYYYYYXXXXXXXXZZZZZZZZ, where X, Y, and Z are numbers of fix length, eight digits. Now, the problem is that I need to parse out the middle sequence of integers and remove any leading zeroes. Unfortunately is the only way to determine where each of the three sequences begins/ends is to count the number of digits.

I am currently doing it in two steps, i.e:

m = re.match(
    r"(?P<first_sequence>\d{8})"
    r"(?P<second_sequence>\d{8})"
    r"(?P<third_sequence>\d{8})",
    string)
second_secquence = m.group(2)
second_secquence.lstrip(0)

Which does work, and gives me the right results, e.g.:

112233441234567855667788 --> 12345678
112233440012345655667788 --> 123456
112233001234567855667788 --> 12345678
112233000012345655667788 --> 123456

But is there a better method? Is is possible to write a single regex expression which matches against the second sequence, sans the leading zeros?

I guess I am looking for a regex which does the following:

  1. Skips over the first eight digits.
  2. Skips any leading zeros.
  3. Captures anything after that, up to the point where there's sixteen characters behind/eight infront.

The above solution does work, as mentioned, so the purpose of this problem is more to improve my knowledge of regex. I appreciate any pointers.


Solution

  • Just to show that it is possible with regex:

    https://regex101.com/r/8RSxaH/2

    # CODE AUTO GENERATED BY REGEX101.COM (SEE LINK ABOVE)
    # coding=utf8
    # the above tag defines encoding for this document and is for Python 2.x compatibility
    
    import re
    
    regex = r"(?<=\d{8})((?:0*)(\d{,8}))(?=\d{8})"
    
    test_str = ("112233441234567855667788\n"
        "112233440012345655667788\n"
        "112233001234567855667788\n"
        "112233000012345655667788")
    
    matches = re.finditer(regex, test_str)
    
    for matchNum, match in enumerate(matches):
        matchNum = matchNum + 1
    
        print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
        for groupNum in range(0, len(match.groups())):
            groupNum = groupNum + 1
    
            print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
    
    # Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
    

    Although you don't really need it to do what you're asking