Search code examples
pythonpandasjupyteranonymity

How to replace non-duplicated values in columns of csv files by stars("*")?


everybody.I need to anonymize the raw table to make a anonymized table. In another word, I need to replace the non_ duplicated sets with stars.

Actually, I have run this code:

    for j in range(len(zz_new)):
        for i in range(len(zz)):
            if zz_new.iloc[j][0] != zz.iloc[i][0]:
                zz_new.iat[j,0]="*"

            if zz_new.iloc[j][1] != zz.iloc[i][1]:
                zz_new.iat[j,1]="*"

            if zz_new.iloc[j][2] != zz.iloc[i][2]:
                zz_new.iat[j,2]="*"

            if zz_new.iloc[j][3] != zz.iloc[i][3]:
                zz_new.iat[j,3]="*"

            if zz_new.iloc[j][4] != zz.iloc[i][4]:
                zz_new.iat[j,4]="*"

, but the result is like this My anonymized table. I was wondering if you could help me to reach the anonymized table.


Solution

  • Use the value_counts() method:

    df                                                                                                                   
         age  education
    0  30-39    HS-grad
    1  40-49  Bachelors
    2  30-39    HS-grad
    3  30-39       11th
    
    vcnt= df.education.value_counts().eq(1)                                                                              
    
    HS-grad      False
    Bachelors     True
    11th          True
    Name: education, dtype: bool
    
    df["education"]= df.education.replace(vcnt.loc[vcnt].index,"*")                                                      
    
         age education
    0  30-39   HS-grad
    1  40-49         *
    2  30-39   HS-grad
    3  30-39         *