Search code examples
pandasdataframeformatstring-length

How to change data in a column using the longer length of a string in PANDAS?


I have a dataframe with two columns: name and id. In some cases the names are repeated, and can also be abbreviated, but they'll always have the same ID, example:

     NAMES          ID
 1. Peter Elliot  12345678

 2. Peter Elliot  12345678

 3. Peter E.      12345678

 4. Lucas Kershaw 87654321

 5. Lucas Kershaw 87654321

 6. Lucas Kershaw 87654321

 7. L. Kershaw    87654321

... other's names and id's...

So my objective is to format the names using the longer length of string, using as reference the ID.

The result should be:

     NAMES          ID
 1. Peter Elliot  12345678

 2. Peter Elliot  12345678

 3. Peter Elliot  12345678

 4. Lucas Kershaw 87654321

 5. Lucas Kershaw 87654321

 6. Lucas Kershaw 87654321

 7. Lucas Kershaw 87654321

... other's names and id's...

I've tried something like this:

a = np.asarray(general['ids']

for j in range(len(a)):

    ToF = df['ids'] == a[j]   
    filts = df[ToF] 
    filtdf = filts.loc[filt, ['names']] 
    read_longer_string = filt400.names.astype(str).map(len)

    if read_longer_string.argmax() == read_longer_string.argmin():
        print('nothing should happen')

    elif read_longer_string.argmax() != read_longer_string.argmin():
        longer_string = filtfd.loc[read_longer_string.argmax(), 'names']   
        df['names'] = df['names'].where(filt == False, df['names'].apply(longer_string.format), errors = 'ignore')

Solution

  • Use Series.str.len with groupby.transform:

    df['NAMES'] = (df.loc[df['NAMES']
                     .str.len()
                     .groupby(df['ID'], sort=False)
                     .transform('idxmax'),'NAMES'].values)
    print(df)
               NAMES        ID
    1   Peter Elliot  12345678
    2   Peter Elliot  12345678
    3   Peter Elliot  12345678
    4  Lucas Kershaw  87654321
    5  Lucas Kershaw  87654321
    6  Lucas Kershaw  87654321
    7  Lucas Kershaw  87654321