Search code examples
pythonpandascdf

Storing CDF in single dataframe cell or several columns


I have 10 item pairs, call them 1A and 1B, 2A and 2B, 3A and 3B -> 10A and 10B in a frame like this:

Item_col1    Item_col2   
1A           1B         
2A           2B         
3A           3B         

Each item (e.g.; 2A) has an associated Cumulative Probability Distribution Function. Each CDF I have stored in a list of np.arrays [CDF_A1, CDF_2A, CDF_3A, CDF_4A], each has 100 elements and look like a little like this:

[0.0000, 0.0100, 0.2000,...0.9999, 1.0]

I'd like to add the CDFs to the frame, ultimately to compare to each other (e.g.; 1A compared to 1B, 2A to 2B) but am at a loss on the best way store them in the frame.

Would it be better to (and is it possible?) to store them like this:

Item_col1    Item_col2    CDF_Item_col1    CDF_Item_col2
1A           1B           CDF_1A           CDF_1B
2A           2B           CDF_2A           CDF_2B
3A           3B           CDF_3A           CDF_3B

OR should it be or does it have to be like this:

Item_col1    Item_col2 (As) CDF_Element1    CDF_Element2....CDF_Element100   (Bs) CDF_Element1    CDF_Element2....CDF_Element100 
1A           1B             0.0000          0.0100          1.0000                0.0000          0.0100          1.0000    
2A           2B             0.0000          0.0100          1.0000                0.0000          0.0100          1.0000
3A           3B             0.0000          0.0100          1.0000                0.0000          0.0100          1.0000

Solution

  • I think you can store them some way like this:

    df
       item1 item2      cdfA      cdfB
    0     1A    1B  0.574843  0.501655
    1     1A    1B  0.574843  0.638855
    2     1A    1B  0.574843  0.827372
    3     1A    1B  0.574843  0.450464
    4     1A    1B  0.162894  0.501655
    5     1A    1B  0.162894  0.638855
    6     1A    1B  0.162894  0.827372
    7     1A    1B  0.162894  0.450464
    8     1A    1B  0.479719  0.501655
    9     1A    1B  0.479719  0.638855
    10    1A    1B  0.479719  0.827372
    11    1A    1B  0.479719  0.450464
    12    1A    1B  0.724478  0.501655
    13    1A    1B  0.724478  0.638855
    14    1A    1B  0.724478  0.827372
    15    1A    1B  0.724478  0.450464
    16    2A    2B  0.827809  0.709354
    17    2A    2B  0.827809  0.657139
    18    2A    2B  0.827809  0.115151
    19    2A    2B  0.827809  0.942483
    20    2A    2B  0.717945  0.709354
    

    As you said, you may further want to compare the values of these CDF as between 1A and 1B, 2A and 2B, .. and so on, if you have your dataframe this way, I think it will be easier for you later to make those comparisons. If you think it is going to occupy more RAM, you can even change item1 and item2 columns to Categorical since they are repeating, as

    cols = ['item1', 'item2']
    for col in cols:
        df[col] = df[col].astype('category')