Search code examples
pythonpandasreplacegroup-bydata-cleaning

How to replace string values of a Pandas column with a string except some values?


Example DataFrame:

import pandas as pd
df = pd.DataFrame({'Age' : [70.0, 58.0, 44.0, 40.0, 21.0, 35.0, 12.0, 43.0, 45.0, 65.0, 56.0, 31.0, 30.0,
                            52.0, 59.0, 52.0, 31.0, 55.0, 42.0, 73.0],
                   'MarketSegment' : ['Travel Agent/Operator', 'Other', 'Other', 'Other', 'Other',
                                      'Direct', 'Groups', 'Other', 'Other', 'Direct', 'Other',
                                      'Other', 'Other', 'Other', 'Groups', 'Groups', 'Other', 'Other',
                                      'Groups', 'Other'],
                   'Nationality' : ['CAN', 'ESP', 'FRA', 'DEU', 'GBR', 'RUS', 'IRL', 'FRA', 'IRL',
                                    'BRA', 'LTU', 'CHE', 'FRA', 'GBR', 'FRA', 'PRT', 'DEU', 'ESP',
                                    'CHE', 'USA']})

First, I only want the top 3 most common nationalities. I used the code below:

top_nat = df.groupby('Nationality').count().sort_values \
(by='Age', ascending = False).head(3).iloc[:, 0].index.to_list()

(Is there any way to do it using only the frequency of unique values in the 'Nationalities' column? And by not using any other column, like 'Age'?)

Now I want all the values in 'Nationalities' to be replaced by 'OTR' except the values == those in top_nat. I tried stuff like these:

df['Nationality'].replace(~top_nat,'OTR', inplace=True)

df["Nationality"] = df["Nationality"].apply(lambda x: x.replace(~top_nat, "OTR"))

for x in top_nat:
    df.loc[df['Nationality'] != x, 'Nationality'] = 'OTR'

Nothing's working. Maybe I want something like:

if values in df.Nationality != values in top_nat:
    replace that value in df.Nationality with 'OTR'
else:
    continue

Shape of original dataset is (82580, 30) and I need top 15 nationalities. Please help.


Solution

  • First get the top 3:

    top = df['Nationality'].value_counts().nlargest(3).index
    

    then set the nationality

    df.loc[~df['Nationality'].isin(top), 'Nationality'] = 'OTR'
    

    This leaves the top 3 nationalities as they are and replaces everything else to 'OTR'