Search code examples
pythonpandasdataframecategories

How to create a category column and explode that into new rows


I have this extremely messed up dataframe:

id     letter_1     letter_2    letter_3     number_1    number_2
1      abc                                   123         
2      def           ghi          jkl                    456
3                    mno          pqr        789         

and basically the dataframe I expect would be:

id     letter/number    data
1      letter           abc
1      number           123
2      letter           def
2      letter           ghi
2      letter           jkl
2      number           456
3      letter           mno
3      letter           pqr
4      number           789

I thought I'd go with the letters first and then the numbers. So I have my dataframe 'data':

data = pd.DataFrame({'id':['1','2','3'],letter_1':['abc','def',''],'letter_2':['','ghi','mno'],'letter_3':['','jkl','pqr'],'number_1':['123','','789'],'number_2':['','456','']})

1- Create a column in 'Category' format by concatenating the columns 'letter_1', 'letter_2' and 'letter_3' *here I'm having difficulties with null values not being part of the category but I'm using:

data['new_col_category'] = data.apply(lambda row: row['letter_1'] + "," + row['letter_2'] + "," + row['letter_3'], axis=1).astype('category')

2- explode that column turning each combination into a new row:

import numpy as np
from itertools import chain

# return list from series of comma-separated strings
def chainer(s):
    return list(chain.from_iterable(s.str.split(',')))

# calculate lengths of splits
lens = data['new_col_category'].str.split(',').map(len)

# create new dataframe, repeating or chaining as appropriate
res = pd.DataFrame({'id': np.repeat(data['id'], lens),
                    'number_1': np.repeat(data['number_1'], lens),
                    'number_2': np.repeat(data['number_2'], lens),
                    'new_col_category': chainer(data['new_col_category'])})

After that, I thought of creating the 'Letter/Number' column and assigning everything as 'Letter'. Then repeat the whole process using the columns of numbers and at the end assign data['Letter/Number'] = 'Number'

Is that make sense? I think I'm missing something. Any help?


Solution

  • Here is a way using stack. First remove the _n from the columns names, then set_index the column id, mask the cell with empty string that would be remove when stacking the data. Then use reset_index and rename to fit the expected output.

    # to keep original data if needed
    res = data.copy()
    # remove the _n from columns names
    res.columns = [c.split('_')[0] for c in res.columns]
    res = (
        res.set_index('id')
           .mask(lambda x: x=='')
           .stack()
           .reset_index(name='data')
           .rename(columns={'level_1':'letter/number'})
    )
    print(res)
    #   id letter/number data
    # 0  1        letter  abc
    # 1  1        number  123
    # 2  2        letter  def
    # 3  2        letter  ghi
    # 4  2        letter  jkl
    # 5  2        number  456
    # 6  3        letter  mno
    # 7  3        letter  pqr
    # 8  3        number  789