Search code examples
pythonpandasdataframepivotdata-cleaning

pivot data, in case of multiple values


I have a pandas timeseries dataframe df with columns date, week, week_start_date, country, campaign_name, active for some date(s) we have multiple campaign information.

for example

data = [["2023.01.02", 1, "2023.01.01", "BR", "SALE-1", 1],
        ["2023.01.02", 1, "2023.01.01", "BR", "SALE-2", 1],
        ["2023.01.02", 1, "2023.01.01", "NL", "SALE-1", 1],
        ["2023.01.02", 1, "2023.01.01", "DE", "SALE-1", 1]]

df = pd.DataFrame(data, columns=["date", "week", "week_start_date", "country", "campaign_name", "active"])



date        week     week_start_date     country    campaign_name     active 
2023.01.02   1       2023.01.01          BR         SALE-1              1
2023.01.02   1       2023.01.01          BR         SALE-2              1
2023.01.02   1       2023.01.01          NL         SALE-1              1
2023.01.02   1       2023.01.01          DE         SALE-1              1

I don't mind having separate date country time-series combination but for the same country in case we have 2 campaigns then I would like to pivot it

date       week  week_start_date  country  campaign_name  active  campaign_name_n active_n total_active
2023.01.02   1     2023.01.01       BR         SALE-1       1       SALE-2         1            2
2023.01.02   1     2023.01.01       NL         SALE-1       1         NaN          NaN          1
2023.01.02   1     2023.01.01       DE         SALE-1       1         NaN          NaN          1

now campaign_name_n , active_n could be any number based on the values we find while running the loop.

I am trying to use

import pandas as pd

# Load your data into a pandas DataFrame
df = pd.read_csv("data.csv")

# Group the data by date, week, week_start_date, country, and days_active
grouped = df.groupby(["date", "week", "week_start_date", "country", "days_active"])

# Create a dictionary to store the campaign names for each group
campaign_names = {}

# Iterate through the groups
for name, group in grouped:
    # Check if there are multiple entries for a particular date
    if len(group) > 1:
        # Create new columns for the campaign names
        for i, row in enumerate(group.itertuples()):
            campaign_name = "campaign_name_{}".format(i + 1)
            campaign_names[row.Index] = campaign_name
            df.at[row.Index, campaign_name] = row.campaign_name

# Add the campaign name columns to the DataFrame
for index, campaign_name in campaign_names.items():
    df.at[index, "campaign_name"] = campaign_name

# Drop the original campaign_name column
df = df.drop(columns=["campaign_name"])

# Save the grouped and modified data to a new file
df.to_csv("grouped_data.csv", index=False)

but I am getting all the campaigns pivoted. which is not intended. Would be great if someone can help here. Thank you!


Solution

  • Try:

    x = df.groupby(["date", "week", "week_start_date", "country"]).agg(
        {"campaign_name": list, "active": [list, "sum"]}
    )
    x.columns = (f"{a}_{b}".replace("_list", "") for a, b in x.columns)
    
    tmp = pd.DataFrame(
        x[["campaign_name", "active"]]
        .apply(
            lambda x: {
                f"{a}_{i}": v for a, b in zip(x.index, x.values) for i, v in enumerate(b, 1)
            },
            axis=1,
        )
        .to_list()
    )
    x = pd.concat([x.reset_index(), tmp], axis=1).drop(columns=["campaign_name", "active"])
    print(x)
    

    Prints:

             date  week week_start_date country  active_sum campaign_name_1 campaign_name_2  active_1  active_2
    0  2023.01.02     1      2023.01.01      BR           2          SALE-1          SALE-2         1       1.0
    1  2023.01.02     1      2023.01.01      DE           1          SALE-1             NaN         1       NaN
    2  2023.01.02     1      2023.01.01      NL           1          SALE-1             NaN         1       NaN