What is the best way to convert a dictionary with the numbers of a category into a column in a Dataframe?
The number of categories in the dictionary is variable, however the total sum of each value in the dictionary equals the length of the Data Frame.
The only important aspect is to preserve the correct order of the categories. A first , then B , etc...
Here is my situation:
import pandas as pd
import numpy as np
# I have dictionaries with categorical data
dic = {"A":2 , "B": 3 , "C" : 1, "D" : 3 }
# And a separate dataframe with data
df = pd.DataFrame(np.random.rand(9,2), columns=['x','y'])
# For my data this test should always be true
sum(list(dic.values())) == len(df)
I want to create a new column df['Cat']
which captures the categories from the dictionary and mantains the same order. E.g. the final output will look like this:
IN: df
OUT:
x y Cat
0 0.741620 0.319183 A
1 0.908586 0.547509 A
2 0.767401 0.106174 B
3 0.315343 0.236445 B
4 0.774537 0.415653 B
5 0.306377 0.721040 C
6 0.114037 0.751824 D
7 0.580801 0.869796 D
8 0.413643 0.980575 D
Here's one way of doing that. Broke the list comprehension part to two for clarity:
dic = {"A":2 , "B": 3 , "C" : 1, "D" : 3 }
l1 = [[k] * v for k, v in dic.items()]
l2 = [i for l in l1 for i in l]
df["Cat"] = pd.Series(l2, dtype="category")
The output is:
x y Cat
0 0.741620 0.319183 A
1 0.908586 0.547509 A
2 0.767401 0.106174 B
3 0.315343 0.236445 B
4 0.774537 0.415653 B
5 0.306377 0.721040 C
6 0.114037 0.751824 D
7 0.580801 0.869796 D
8 0.413643 0.980575 D