Search code examples
pythonpython-re

Find and replace numeric (which can have decimal points) ids


I have a bunch of numeric IDs I need to number with new numeric IDs

id="12.03"

id="23.343.Fdf--"

id="12-B.fdas7232"

id="12."

id="1."

id="1.-2"

id="2.02-R.-vdfs--erev-j"

id="48-34JJf"

id="5.01-G.f"

Using this regex:

 id="[1-9]\d*(\.\d+)?

at https://regexr.com/, I am able to get the correct matches.

However, when I run the python script, I think it has to do with capturing groups returning too many values.

Here are two examples of the printed output:

(' id="5.01', ' id="', '5.01', '.01') (' id="48', ' id="', '48', '')

I don't know how to stop it from returning the 4th value '.01' or '' in the above 2 examples.

I get this error: too many values to unpack (expected 3)

I've tried several different Regex variations to try to get it to return a single string, like adding additional parentheses, ^ and $ to mark the beginning and end of the string, etc.

    PID_REPLACEMENTS = {
    "48":'9',
    "23.343":'8',
    "12.03":'7',
    "12":'6',
    "5.01":'5',
    "2.02":'4',
    "1":'3.08'}

    my_text = substitute_oldid_index(my_text)

def substitute_oldid_index(my_text):
    return substitute_newid(r"""((?P<pre> id=")(?P<post>[1-9]\d*(\.\d+)?))""", my_text)


def substitute_newid (findallnewid_regex, my_text):
    data_oldids = re.findall(findallnewid_regex, my_text, re.I)

    print(data_oldids)

    for combined, pre, post in data_oldids:
    if post.title() not in PID_REPLACEMENTS:
        continue

    my_text = re.sub(combined, "{}{}".format(pre, PID_REPLACEMENTS[post.title()]), my_text)

    return my_text

Is there a better way to find numeric IDs (that may contain decimal points and additional periods or text after them that should remain static) and replace them with new numeric IDs (that may or may not contain decimal points)? I assume we want to do it in reverse chronological order so that lower numbers aren't found more than once?

Is there a way to fix my regex and script to achieve this goal?


As a follow-up question, I have a bunch of ranges in a spreadsheet that needs conversion to new ID numbers.

EXAMPLE 1: 5.01-48; 151.01-168; 224-382; 415-510; 218-249

EXAMPLE 2: 128-211; 257-281; 386-401

Is there a way to search these numbers and replace them with a new number?

For example, find 5.01 and replace it with 5 as above from the dictionary


Solution

  • I think you're making this harder than it needs to be, with the pre- and post-matches. Why not just look for digits, optionally followed by a dot and digits, and if that set is in your list, replace it? This does that:

    import re
    
    PID_REPLACEMENTS = {
    "48":'9',
    "23.343":'8',
    "12.03":'7',
    "12":'6',
    "5.01":'5',
    "2.02":'4',
    "1.":'3.08'}
    
    sample = """
    id="12.03"         12.03
    id="23.343.Fdf--"  23.343
    id="12-B.fdas7232"
    id="12."           12.
    id="1."            1.
    id="1.-2"
    id="2.02-R.-vdfs--erev-j"
    id="48-34JJf"
    id="5.01-G.f"      5.01
    id="[1-9]\d*(\.\d+)?
    EXAMPLE 1: 5.01-48; 151.01-168; 224-382; 415-510; 218-249
    EXAMPLE 2: 128-211; 257-281; 386-401
    """
    
    def subst(m):
        m = m.group(0)
        return PID_REPLACEMENTS.get(m,m)
    
    def substitute_newid(my_text):
        return re.sub('(?<=id=")\d+(\.\d*)?', subst, my_text)
    
    print( substitute_newid(sample) )
    """
    
    def subst(m):
        m = m.group(0)
        return PID_REPLACEMENTS.get(m,m)
    
    def substitute_newid(my_text):
        return re.sub('(?<=id=")\d+(\.\d*)?', subst, my_text)
    
    print( substitute_newid(sample) )
    

    Output:

    
    id="7"         12.03
    id="8.Fdf--"  23.343
    id="6-B.fdas7232"
    id="12."           12.
    id="3.08"            1.
    id="3.08-2"
    id="4-R.-vdfs--erev-j"
    id="9-34JJf"
    id="5-G.f"      5.01
    id="[1-9]\d*(\.\d+)?
    EXAMPLE 1: 5.01-48; 151.01-168; 224-382; 415-510; 218-249
    EXAMPLE 2: 128-211; 257-281; 386-401