I aim to generate some synthetic data of 1000 rows (to be represented in a pd.DataFrame
object) populated with a set of categorical variables.
Suppose I have a dict object of all possible categorical variables which can exist.
The list is sorted in priority order, with 'Aaa'
being highest priority and 'NR'
being lowest priority.
credit_score_types = {
'Aaa':0,
'Aa1':1,
'Aa2':2,
'Aa3':3,
'A1':4,
'A2':5,
'A3':6,
'Baa1':7,
'Baa2':8,
'Baa3':9,
'Ba1':10,
'Ba2':11,
'Ba3':12,
'B1':13,
'B2':14,
'B3':15,
'Caa':16,
'Ca':17,
'C':18,
'e, p':19,
'WR':20,
'unsolicited':21,
'NR':22
}
The dict
object key
with the median value
will represent the "peak" of the Normal Distribution.
In this case 'Ba2'
will be the "peak" of the Normal Distribution.
The expected outcome:
To randomly assign a pd.DataFrame
with 1000 rows (or a populated list
with length 1000) using categorical variables from the above dict
object. The assignment of categorical variables will follow a Normal Distribution.
'Baa2'
will have the highest count.
If a bar chart is plotted with the count of each categorical occurrence, I would observe a graph of normally distributed shape (similar to below).
A normal distribution is continuous and not categorical. You might consider binning normally distributed data with intervals that have width 1.0: i.e. 'Baa2' which has a peak of 11 will actually count all normally distributed values in the interval [10.5, 11.5], 'Baa1' will count all values in the interval [9.5, 10.5]... 'Aaa' will count all values in the interval [-0.5, 0.5], and so on...
import numpy as np
import matplotlib.pyplot as plt
credit_score_types = {
'Aaa':0,
'Aa1':1,
'Aa2':2,
'Aa3':3,
'A1':4,
'A2':5,
'A3':6,
'Baa1':7,
'Baa2':8,
'Baa3':9,
'Ba1':10,
'Ba2':11,
'Ba3':12,
'B1':13,
'B2':14,
'B3':15,
'Caa':16,
'Ca':17,
'C':18,
'e, p':19,
'WR':20,
'unsolicited':21,
'NR':22
}
# generate normally distributed data, fix random state
np.random.seed(42)
mu, sigma = credit_score_types['Ba2'], 5
X = np.random.normal(mu, sigma, 1000)
fig, ax = plt.subplots()
counts, bins = np.histogram(X, bins = np.linspace(-0.5, 22.5, 23))
# create a new dictionary of category names and counts
data = dict(zip(credit_score_types.keys(), counts))
ax.bar(data.keys(), data.values())
plt.xticks(rotation = 'vertical')
plt.show()