Search code examples
pythonpandasfaker

Pandas loop and anonymise data


I need to read an excel file and loop over rows, retain value for a column Location, anonymise the other values for the same column, and output the result for all combinations to seperate pdf files

Original DF

| Location  | Patients | Deceased | Rank |
|-----------|----------|----------|------|
| Leicester | 2000     | 13       | 1    |
| Coventry  | 2200     | 24       | 2    |
| Norwich   | 3000     | 56       | 3    |
| Sheffield | 2800     | 65       | 4    |
| Luton     | 1800     | 90       | 5    |

Desired DF #1

| Location  | Patients | Deceased | Rank |
|-----------|----------|----------|------|
| Leicester | 2000     | 13       | 1    |
| ########  | 2200     | 24       | 2    |
| ########  | 3000     | 56       | 3    |
| ########  | 2800     | 65       | 4    |
| ########  | 1800     | 90       | 5    |

Desired DF #2

| Location  | Patients | Deceased | Rank |
|-----------|----------|----------|------|
| ########  | 2000     | 13       | 1    |
| Coventry  | 2200     | 24       | 2    |
| ########  | 3000     | 56       | 3    |
| ########  | 2800     | 65       | 4    |
| ########  | 1800     | 90       | 5    |

I want to write out each DF using pd.to_html and then convert.

I'm finding it hard to figure out how to implement this in pandas or how best use iterrows and faker to achieve this or whether a replace() would do in each iteration even.


Solution

  • The easiest way to do it, it's to play with the index to change one value by one value

    import pandas as pd
    
    import pandas as pd
    
    # Create the fake data
    data = {
        'Location': ['Leicester', 'Coventry', 'Norwich', 'Sheffield', 'Luton'],
        'Patients': [2000, 2200, 3000, 2800, 1800],
        'Deceased': [13, 24, 56, 65, 90],
        'Rank': [1, 2, 3, 4, 5]
    }
    
    # Create a DataFrame
    df = pd.DataFrame(data)
    
    # Step 2: Function to create a new DataFrame with anonymized Locations except for one row
    def anonymize_locations(df, index_to_keep):
        # Create a copy of the original DataFrame to avoid modifying it
        df_copy = df.copy()
        # Replace all Location values with '########' except for the index_to_keep
        df_copy['Location'] = df_copy['Location'].apply(lambda x: '########')
        df_copy.at[index_to_keep, 'Location'] = df.at[index_to_keep, 'Location']
        return df_copy
    
    
    # Step 3: Loop over each row and generate the desired DataFrames and save to html
    for index in df.index:
        anonymized_df = anonymize_locations(df, index)
        file_name = f"output_{index + 1}.html"
        anonymized_df.to_html(file_name)
        print(f"Saved {file_name}")