Search code examples
python-3.xpandasstatisticsnormal-distribution

Generating a list of categorical variables where categorical count is normally distributed


I aim to generate some synthetic data of 1000 rows (to be represented in a pd.DataFrame object) populated with a set of categorical variables.

Suppose I have a dict object of all possible categorical variables which can exist.

The list is sorted in priority order, with 'Aaa' being highest priority and 'NR' being lowest priority.

credit_score_types = {
    'Aaa':0,
    'Aa1':1,
    'Aa2':2,
    'Aa3':3,
    'A1':4,
    'A2':5,
    'A3':6,
    'Baa1':7,
    'Baa2':8,
    'Baa3':9,
    'Ba1':10,
    'Ba2':11,
    'Ba3':12,
    'B1':13,
    'B2':14,
    'B3':15,
    'Caa':16,
    'Ca':17,
    'C':18,
    'e, p':19,
    'WR':20,
    'unsolicited':21,
    'NR':22
}

The dict object key with the median value will represent the "peak" of the Normal Distribution.

In this case 'Ba2' will be the "peak" of the Normal Distribution.

The expected outcome:

To randomly assign a pd.DataFrame with 1000 rows (or a populated list with length 1000) using categorical variables from the above dict object. The assignment of categorical variables will follow a Normal Distribution.

'Baa2' will have the highest count.

If a bar chart is plotted with the count of each categorical occurrence, I would observe a graph of normally distributed shape (similar to below).

Graph to illustrate the expected shape when plotting the categorical variables.


Solution

  • A normal distribution is continuous and not categorical. You might consider binning normally distributed data with intervals that have width 1.0: i.e. 'Baa2' which has a peak of 11 will actually count all normally distributed values in the interval [10.5, 11.5], 'Baa1' will count all values in the interval [9.5, 10.5]... 'Aaa' will count all values in the interval [-0.5, 0.5], and so on...

    import numpy as np 
    import matplotlib.pyplot as plt
    
    credit_score_types = {
        'Aaa':0,
        'Aa1':1,
        'Aa2':2,
        'Aa3':3,
        'A1':4,
        'A2':5,
        'A3':6,
        'Baa1':7,
        'Baa2':8,
        'Baa3':9,
        'Ba1':10,
        'Ba2':11,
        'Ba3':12,
        'B1':13,
        'B2':14,
        'B3':15,
        'Caa':16,
        'Ca':17,
        'C':18,
        'e, p':19,
        'WR':20,
        'unsolicited':21,
        'NR':22
    }
    
    # generate normally distributed data, fix random state 
    np.random.seed(42)
    mu, sigma = credit_score_types['Ba2'], 5
    X = np.random.normal(mu, sigma, 1000)
    
    fig, ax = plt.subplots()
    
    counts, bins = np.histogram(X, bins = np.linspace(-0.5, 22.5, 23))
    
    # create a new dictionary of category names and counts
    data = dict(zip(credit_score_types.keys(), counts))
    ax.bar(data.keys(), data.values())
    plt.xticks(rotation = 'vertical')
    
    plt.show()
    

    enter image description here