Search code examples
pythonpandasmelt

reshaping the dataset in python


I have this dataset:

Account lookup FY11USD FY12USD FY11local FY12local
Sales CA 1000 5000 800 4800
Sales JP 5000 6500 10 15

Trying to arrive to get the data in this format: (below example has 2 years of data but no. of years can vary)

Account lookup Year USD Local
Sales CA FY11 1000 800
Sales CA FY12 5000 4800
Sales JP FY11 5000 10
Sales JP FY12 6500 15

I tried using the below script, but it doesn't segregate USD and local for the same year. How should I go about that?

df.melt(id_vars=["Account", "lookup"], 
    var_name="Year", 
    value_name="Value")

Solution

  • One efficient option is to transform to long form with pivot_longer from pyjanitor, using the .value placeholder ---> the .value determines which parts of the columns remain as headers:

    # pip install pyjanitor
    import pandas as pd
    import janitor
    
    df.pivot_longer(
         index = ['Account', 'lookup'], 
         names_to = ('Year', '.value'), 
         names_pattern = r"(FY\d+)(.+)")
    
      Account lookup  Year   USD  local
    0   Sales     CA  FY11  1000    800
    1   Sales     JP  FY11  5000     10
    2   Sales     CA  FY12  5000   4800
    3   Sales     JP  FY12  6500     15
    

    Another option is to use stack:

    temp = df.set_index(['Account', 'lookup'])
    temp.columns = temp.columns.str.split('(FY\d+)', expand = True).droplevel(0)
    temp.columns.names = ['Year', None]
    temp.stack('Year').reset_index()
    
      Account lookup  Year   USD  local
    0   Sales     CA  FY11  1000    800
    1   Sales     CA  FY12  5000   4800
    2   Sales     JP  FY11  5000     10
    3   Sales     JP  FY12  6500     15
    

    You can also pull it off with pd.wide_to_long after reshaping the columns:

    index = ['Account', 'lookup']
    temp = df.set_index(index)
    temp.columns = (temp
                    .columns
                    .str.split('(FY\d+)')
                    .str[::-1]
                    .str.join('')
                   )
    (pd.wide_to_long(
          temp.reset_index(), 
          stubnames = ['USD', 'local'], 
          i = index, 
          j = 'Year', 
          suffix = '.+')
    .reset_index()
    )
    
      Account lookup  Year   USD  local
    0   Sales     CA  FY11  1000    800
    1   Sales     CA  FY12  5000   4800
    2   Sales     JP  FY11  5000     10
    3   Sales     JP  FY12  6500     15