Search code examples
pythonpandasdataframetorchcategorical-data

How to generate numeric mapping for categorical columns in pandas?


I want to manipulate categorical data using pandas data frame and then convert them to numpy array for model training.

Say I have the following data frame in pandas.

import pandas as pd
df2 = pd.DataFrame({"c1": ['a','b',None], "c2": ['d','e','f']})

>>> df2
     c1 c2
0     a  d
1     b  e
2  None  f

And now I want "compress the categories" horizontally as the following:

   compressed_categories
0     c1-a,   c2-d           <--- this could be a string, ex. "c1-a, c2-d" or array ["c1-a", "c2-d"] or categorical data
1     c1-b,   c2-e
2     c1-nan, c2-f

Next I want to generate a dictionary/vocabulary based on the unique occurrences plus "nan" columns in compressed_categories, ex:

volcab = {
"c1-a": 0,
"c1-b": 1,
"c1-c": 2,
"c1-nan": 3,
"c2-d": 4,
"c2-e": 5,
"c2-f": 6,
"c2-nan": 7,

}

So I can further numerically encoding then as follows:

   compressed_categories_numeric
0     [0,   4]
1     [1,   5]
2     [3,   6]

So my ultimate goal is to make it easy to convert them to numpy array for each row and thus I can further convert it to tensor.

input_data = np.asarray(df['compressed_categories_numeric'].tolist())

then I can train my model using input_data.

Can anyone please show me an example how to make this series of conversion? Thanks in advance!


Solution

  • To build volcab dictionary and compressed_categories_numeric, you can use:

    df3 = df2.fillna(np.nan).astype(str).apply(lambda x: x.name + '-' + x)
    volcab = {k: v for v, k in enumerate(np.unique(df3))}
    df2['compressed_categories_numeric'] = df3.replace(volcab).agg(list, axis=1)
    

    Output:

    >>> volcab
    {'c1-a': 0, 'c1-b': 1, 'c1-nan': 2, 'c2-d': 3, 'c2-e': 4, 'c2-f': 5}
    
    >>> df2
         c1 c2 compressed_categories_numeric
    0     a  d                        [0, 3]
    1     b  e                        [1, 4]
    2  None  f                        [2, 5]
    
    >>> np.array(df2['compressed_categories_numeric'].tolist())
    array([[0, 3],
           [1, 4],
           [2, 5]])