I have a dataframe with two columns: name and id. In some cases the names are repeated, and can also be abbreviated, but they'll always have the same ID, example:
NAMES ID
1. Peter Elliot 12345678
2. Peter Elliot 12345678
3. Peter E. 12345678
4. Lucas Kershaw 87654321
5. Lucas Kershaw 87654321
6. Lucas Kershaw 87654321
7. L. Kershaw 87654321
... other's names and id's...
So my objective is to format the names using the longer length of string, using as reference the ID.
The result should be:
NAMES ID
1. Peter Elliot 12345678
2. Peter Elliot 12345678
3. Peter Elliot 12345678
4. Lucas Kershaw 87654321
5. Lucas Kershaw 87654321
6. Lucas Kershaw 87654321
7. Lucas Kershaw 87654321
... other's names and id's...
I've tried something like this:
a = np.asarray(general['ids']
for j in range(len(a)):
ToF = df['ids'] == a[j]
filts = df[ToF]
filtdf = filts.loc[filt, ['names']]
read_longer_string = filt400.names.astype(str).map(len)
if read_longer_string.argmax() == read_longer_string.argmin():
print('nothing should happen')
elif read_longer_string.argmax() != read_longer_string.argmin():
longer_string = filtfd.loc[read_longer_string.argmax(), 'names']
df['names'] = df['names'].where(filt == False, df['names'].apply(longer_string.format), errors = 'ignore')
Use Series.str.len
with groupby.transform
:
df['NAMES'] = (df.loc[df['NAMES']
.str.len()
.groupby(df['ID'], sort=False)
.transform('idxmax'),'NAMES'].values)
print(df)
NAMES ID
1 Peter Elliot 12345678
2 Peter Elliot 12345678
3 Peter Elliot 12345678
4 Lucas Kershaw 87654321
5 Lucas Kershaw 87654321
6 Lucas Kershaw 87654321
7 Lucas Kershaw 87654321