Search code examples
pythonpandasdataframemelt

How to handle duplicated or blank columns and add number in duplicated ones in python dataframe?


Dataframe has 200+columns with repeated dates and empty columns

         weight height total  weight height total
          2019   2019   2019  2020   2020   2020
Species  jan1   jan1    ''    jan1   jan1    ''
cat      1.0    2.0     3     4.0    3.0     7
dog      3.0    4.0     9     4.0    5.0     9

I tried:

[x for x in df.columns if df.columns.count(x) >1]

#error: 'MultiIndex' object has no attribute 'count'

df.stack(dropna=False)

#error: cannot reindex from a duplicate axis

Objective: To add any string value like 'a.jan1' repeated column and blank columns renamed as a, b.... so on.

Output required in tabular form for further processing and storage ..


 class    year    Month    cat    dog   
 weight   2019    jan1     1       3    
 height   2019    jan1     2       4
 weight   2020    jan1     4       4
 height   2020    jan1     3       5

Solution

  • So, given the following dataframe:

    import pandas as pd
    
    df = pd.DataFrame(
        {
            ("weight", 2019, "jan1"): {"cat": 1, "dog": 3},
            ("height", 2019, "jan1"): {"cat": 2, "dog": 4},
            ("total", 2019, ""): {"cat": 3, "dog": 9},
            ("weight", 2020, "jan1"): {"cat": 4, "dog": 4},
            ("height", 2020, "jan1"): {"cat": 3, "dog": 5},
            ("total", 2020, ""): {"cat": 7, "dog": 9},
        }
    )
    print(df)
    # Outputs
        weight height total weight height total
          2019   2019  2019   2020   2020  2020
          jan1   jan1         jan1   jan1      
    cat      1      2     3      4      3     7
    dog      3      4     9      4      5     9
    

    You could try this:

    # UNpivot the dataframe
    new_df = df.reset_index().melt(
        id_vars=[("index", "", "")],
        value_vars=[
            ("weight", 2019, "jan1"),
            ("height", 2019, "jan1"),
            ("weight", 2020, "jan1"),
            ("height", 2020, "jan1"),
        ],
    )
    new_df.columns = ["species", "class", "year", "month", "value"]
    
    # Make separate dataframes for "cats" and "dogs" and store them in a list
    temp_dfs = []
    for species in new_df["species"].unique():
        temp_df = new_df.loc[new_df["species"] == species, :]
        temp_df = temp_df.rename(columns={"value": species}).drop(columns="species")
        temp_dfs.append(temp_df)
    
    # Merge "cats" and "dogs"
    final_df = temp_dfs[0]
    for temp_df in temp_dfs[1:]:
        final_df = pd.merge(final_df, temp_df, on=["class", "year", "month"], how="outer")
    

    And so:

    print(final_df)
    # Output
        class  year month  cat  dog
    0  weight  2019  jan1    1    3
    1  height  2019  jan1    2    4
    2  weight  2020  jan1    4    4
    3  height  2020  jan1    3    5