python pandas dataframe replace str-replace

Replacing numerical IDs contained as lists of strings and lists of ints in a Pandas DataFrame

In order to make data anonymous, I need to replace original IDs with a different set of new IDs, but still have identical original IDs match across all fields after being replaced. The challenge is doing so elegantly across 4 different presentations of IDs in this Pandas DataFrame.

I have real-world data where numerical IDs are found in 4 possible formats:

Within a list of strings '["38", "15", "42"]'
Within a list of numbers [14, 42, 94]
As integers 42
As floats 1.0

Here is a generalized small DataFrame with all 4 data types.

df = pd.DataFrame([['["38", "15", "42"]', [14, 42, 94], 42, 1.0],\
 ['["8", "28"]', [1, 4], 8, 94.0], ['["12"]', [12], 12, 12.0]],\
 columns = ['CommentsID','AgentID','CaseID','TicketID'])

df
| CommentsID            | AgentID         | CaseID  | TicketID |
| --------------------- | --------------- | ------- | -------- |
| ['38', '15', '42']    | [14, 42, 94]    | 42      | 1.0      |
| ['8', '28']           | [1, 4]          | 8       | 94.0     |
| ['12']                | [12]            | 12      | 12.0     |

For ease of use in a generalized example, I'm just adding 100 to generate a list of 'new IDs'. However, in the actual problem, the list of corresponding new IDs are randomly generated, so no adding 100 throughout to solve for this one.

orig_ids = list(range(100))
new_ids = [x + 100 for x in orig_ids]

What I would like is to find the most efficient way to replace all of the original IDs in the dataframe with the new IDs across these four data types.

Best solution I have so far is to split into three parts:

Tackle the float and int versions with a replace() function (this does not affect the lists, even with regex=True):

df = df.replace(orig_ids, new_ids)

For the lists of integers, use a far-from-Pythonic doubled for-loop to match indices on the ID lists:

def newIDnumbers(datacol):
    newlist = []
    for i in range(len(datacol)):
        numlist = [orig_ids.index(x) for x in df.AgentID[i]]
        newlistrow = []
        for idx in range(len(numlist)):
            newlistrow.append(new_ids[numlist[idx]])
        newlist.append(newlistrow)
    return newlist

df.AgentID = newIDnumbers(df.AgentID)                     
df

For the lists of strings, build string lists of original ids and new ids, then use a far-from-Pythonic doubled for-loop to match indices on the ID lists:

str_orig_ids = [str(x) for x in orig_ids]
str_new_ids = [str(x) for x in new_ids]

def newIDstrings(datacol):
    newlist = []
    for i in range(len(datacol)):
        numlist = [str_orig_ids.index(x) for x in datacol.str.findall(r'"(\d*)"')[i]]
        newlistrow = []
        for idx in range(len(numlist)):
            newlistrow.append(str_new_ids[numlist[idx]])
        newlist.append(newlistrow)
    return newlist

df.CommentsID = [str(x) for x in newIDstrings(df.CommentsID)]
df

Does anyone have a more elegant and less computationally-heavy way to achieve this output?

df
| CommentsID            | AgentID         | CaseID  | TicketID |
| --------------------- | --------------- | ------- | -------- |
| ['138', '115', '142'] | [114, 142, 194] | 142     | 100.0    |
| ['108', '128']        | [101, 104]      | 108     | 194.0    |
| ['112']               | [112]           | 112     | 112.0    |

Solution

Here is one way using reshaping and factorize to anonymize the IDs:

from ast import literal_eval

def df_factorize(df):
    idx = df.index
    s = df.reset_index(drop=True).stack()  # stack to ensure consistent
    s[:] = s.factorize()[0]                # factors among all columns
    return s.unstack().set_index(idx)

df2 = (df
   .assign(CommentsID=df['CommentsID'].apply(literal_eval))   # extract as integers
   .explode(['CommentsID', 'AgentID'])                        # lists to rows
   .astype({'CommentsID': int}) # change str to int
   .pipe(df_factorize)                                        # anonymize
   .groupby(level=0)                                          # below to reshape
   .agg({'CommentsID': lambda s: str(list(s.astype(str))),    # to original form
         'AgentID': list,
         'CaseID': 'first',
         'TicketID': 'first',
        })
   .astype(df.dtypes)                                         # original dtypes
)

output:

        CommentsID    AgentID  CaseID  TicketID
0  ['0', '4', '2']  [1, 2, 5]       2       3.0
1       ['6', '7']     [3, 8]       6       5.0
2            ['9']        [9]       9       9.0

In case you want "real" anonymization, you could use a uuid:

from uuid import uuid4

def df_uuid(df):
    idx = df.index
    s = df.reset_index(drop=True).stack()
    f = s.unique()
    s = s.map(dict(zip(f, [str(uuid4()) for _ in range(len(f))])))
    return s.unstack().set_index(idx)

output:

                                                                                                                 CommentsID                                                                                                             AgentID                                CaseID                              TicketID
0  ['e9fa80ed-58e2-4a96-a8da-0bc0c3a87e14', 'c87ca55a-e0a4-4618-ab5a-2eaf68d5a70f', '7cba72a8-3f2a-42e7-a781-b8ed242ea2ac']  [65f95f56-877e-4a9f-be88-4302fe1f77ea, 7cba72a8-3f2a-42e7-a781-b8ed242ea2ac, 3cf48cd7-e1b0-4c7f-bbeb-41836e309511]  7cba72a8-3f2a-42e7-a781-b8ed242ea2ac  549f1768-39b5-4e74-a8dd-4edbf3bb591f
1                                          ['6e54d7d6-ae2f-4319-950e-7629128d518a', '8fef2e1c-99d1-4460-b8a4-f4dbbf213c5f']                                        [549f1768-39b5-4e74-a8dd-4edbf3bb591f, 98f82da0-fa9c-447a-943b-b11875736128]  6e54d7d6-ae2f-4319-950e-7629128d518a  3cf48cd7-e1b0-4c7f-bbeb-41836e309511
2                                                                                  ['3396c9db-1244-4b2d-bb76-14bb0221778f']                                                                              [3396c9db-1244-4b2d-bb76-14bb0221778f]  3396c9db-1244-4b2d-bb76-14bb0221778f  3396c9db-1244-4b2d-bb76-14bb0221778f