I have 10 item pairs, call them 1A and 1B, 2A and 2B, 3A and 3B -> 10A and 10B in a frame like this:
Item_col1 Item_col2
1A 1B
2A 2B
3A 3B
Each item (e.g.; 2A) has an associated Cumulative Probability Distribution Function. Each CDF I have stored in a list of np.arrays [CDF_A1, CDF_2A, CDF_3A, CDF_4A], each has 100 elements and look like a little like this:
[0.0000, 0.0100, 0.2000,...0.9999, 1.0]
I'd like to add the CDFs to the frame, ultimately to compare to each other (e.g.; 1A compared to 1B, 2A to 2B) but am at a loss on the best way store them in the frame.
Would it be better to (and is it possible?) to store them like this:
Item_col1 Item_col2 CDF_Item_col1 CDF_Item_col2
1A 1B CDF_1A CDF_1B
2A 2B CDF_2A CDF_2B
3A 3B CDF_3A CDF_3B
OR should it be or does it have to be like this:
Item_col1 Item_col2 (As) CDF_Element1 CDF_Element2....CDF_Element100 (Bs) CDF_Element1 CDF_Element2....CDF_Element100
1A 1B 0.0000 0.0100 1.0000 0.0000 0.0100 1.0000
2A 2B 0.0000 0.0100 1.0000 0.0000 0.0100 1.0000
3A 3B 0.0000 0.0100 1.0000 0.0000 0.0100 1.0000
I think you can store them some way like this:
df
item1 item2 cdfA cdfB
0 1A 1B 0.574843 0.501655
1 1A 1B 0.574843 0.638855
2 1A 1B 0.574843 0.827372
3 1A 1B 0.574843 0.450464
4 1A 1B 0.162894 0.501655
5 1A 1B 0.162894 0.638855
6 1A 1B 0.162894 0.827372
7 1A 1B 0.162894 0.450464
8 1A 1B 0.479719 0.501655
9 1A 1B 0.479719 0.638855
10 1A 1B 0.479719 0.827372
11 1A 1B 0.479719 0.450464
12 1A 1B 0.724478 0.501655
13 1A 1B 0.724478 0.638855
14 1A 1B 0.724478 0.827372
15 1A 1B 0.724478 0.450464
16 2A 2B 0.827809 0.709354
17 2A 2B 0.827809 0.657139
18 2A 2B 0.827809 0.115151
19 2A 2B 0.827809 0.942483
20 2A 2B 0.717945 0.709354
As you said, you may further want to compare the values of these CDF as between 1A and 1B, 2A and 2B, .. and so on, if you have your dataframe this way, I think it will be easier for you later to make those comparisons. If you think it is going to occupy more RAM, you can even change item1 and item2 columns to Categorical since they are repeating, as
cols = ['item1', 'item2']
for col in cols:
df[col] = df[col].astype('category')